Machine learning for predicting response scores for drugs

ABSTRACT

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for generating a respective response score for each of a plurality of patient categories. In one aspect, a method comprises: generating a drug signature for a drug; generating an embedding of the drug signature in a latent space; and processing: (i) the embedding of the drug signature in the latent space, and (ii) data defining a plurality of patient categories, to generate a plurality of response scores, wherein each response score corresponds to a respective patient category and characterizes a predicted response of patients included in the patient category to the drug.

RELATED APPLICATIONS

This application is a continuation of and claims priority to U.S. Pat.Application No. 17/960,755, filed Oct. 5, 2022, which claims priority toU.S. Provisional Application No. 63/252,523, filed Oct. 5, 2011; U.S.Provisional Application No. 63/252,539, filed Oct. 5, 2021; U.S.Provisional Application No. 63/252,562, filed Oct. 5, 2021; U.S.Provisional Application No. 63/252,500, filed Oct. 5, 2021; U.S.Provisional Application No. 63/292,115, filed Dec. 21, 2021, U.S.Provisional Application No. 63/294,751, filed Dec. 29, 2021; U.S.Provisional Application No. 63/328,189, filed Apr. 6, 2022; U.S.Provisional Application No. 63/337,753, filed May 3, 2022; U.S.Provisional Application No. 63/400,250, filed Aug. 23, 2022; and U.S.Provisional Application No. 63/413,150, filed Oct. 4, 2022, all of whichare incorporated herein by reference.

BACKGROUND

This specification relates to processing data using machine learningmodels.

Machine learning models receive an input and generate an output, e.g., apredicted output, based on the received input. Some machine learningmodels are parametric models and generate the output based on thereceived input and on values of the parameters of the model.

Some machine learning models are deep models that employ multiple layersof models to generate an output for a received input. For example, adeep neural network is a deep machine learning model that includes anoutput layer and one or more hidden layers that each apply a nonlineartransformation to a received input to generate an output.

SUMMARY

This specification describes a machine learning system implemented ascomputer programs on one or more computers in one or more locations forprocessing multi-modal data characterizing patients.

Throughout this specification, a data “modality” refers to a type ofdata, e.g., that is generated using a specified sensor or medicaldiagnostic technique, and “multi-modal” data refers to a collection ofdata from multiple different modalities. An “embedding” refers to anordered collection of numerical values, e.g., a vector, matrix, or othertensor of numerical values. The term “patient” is used interchangeablywith the term “subject.”

Throughout this specification, a first set of elements is referred to asa “proper” subset of a second set of elements if: (i) the first set is asubset of the second set, and (ii) the first set includes fewer than allof the elements in the second set.

According to a first aspect, there is provided a method comprising:receiving multi-modal data characterizing a patient, wherein themulti-modal data comprises a respective feature representation for eachof a plurality of modalities; processing the multi-modal datacharacterizing the patient using an encoder neural network to generatean embedding of the multi-modal data characterizing the patient;determining a respective classification score for each patient categoryin a set of patient categories based on the embedding of the multi-modaldata characterizing the patient; and classifying the patient as beingincluded in a corresponding patient category from the set of patientcategories based on the classification scores.

In some implementations, the method further comprises, prior toreceiving the multi-modal data characterizing the patient, determiningthe set of patient categories based on a set of training patients;wherein determining the set of patient categories based on the set oftraining patients comprises: receiving respective multi-modal datacharacterizing each training patient in the set of training patients;generating a set of training embeddings that each correspond to arespective training patient, wherein generating the training embeddingfor a training patient comprises processing the multi-modal datacharacterizing the training patient using the encoder neural network;and determining a partition of the set of training embeddings into a setof clusters of training embeddings, wherein each cluster of trainingembeddings comprises a plurality of training embeddings and represents arespective patient category.

In some implementations, determining the partition of the set oftraining embeddings into the set of clusters of training embeddingscomprises: applying a clustering operation to the set of trainingembeddings to partition the set of training embeddings into the set ofclusters of training embeddings.

In some implementations, each training patient is identified as beingincluded in a patient category represented by the cluster of trainingembeddings that includes the training embedding for the trainingpatient.

In some implementations, the method further comprises: determining,based on the classification of the patient as being included in thecorresponding patient category, that the patient should receive aparticular medical treatment.

In some implementations, the method further comprises applying theparticular medical treatment to the patient in response to determiningthat the patient should receive the particular medical treatment.

In some implementations, the patient category that includes the patientalso includes a plurality of training patients, wherein each trainingpatient is associated with a respective class from a set of classes, anddetermining that the patient should receive the particular medicaltreatment comprises: determining a class distribution for the patientcategory that defines, for each class in the set of classes, arespective fraction of training patients included in the patientcategory that are associated with the class; and determining that thepatient should receive the particular medical treatment based on theclass distribution for the patient category.

In some implementations, the set of classes includes a first class and asecond class, wherein each training patient associated with the firstclass is classified as having responded to the particular medicaltreatment when the particular medical treatment was applied to thetraining patient, and wherein each training patient associated with thesecond class is classified as having not responded to the particularmedical treatment when the particular medical treatment was applied tothe training patient.

In some implementations, determining the respective classification scorefor each patient category in the set of patient categories based on theembedding of the multi-modal data characterizing the patient comprises:processing the embedding of the multi-modal data characterizing thepatient using a classification machine learning model to generate therespective classification score for each patient category in the set ofpatient categories; wherein the classification machine learning modelhas been trained on a set of training examples, wherein each trainingexample corresponds to a respective training patient from the set oftraining patients and comprises: (i) the training embedding for thetraining patient, and (ii) a label identifying a respective patientcategory of the training patient.

In some implementations, determining the respective classification scorefor each patient category in the set of patient categories based on theembedding of the multi-modal data characterizing the patient comprises,for each patient category: determining a centroid embedding for thepatient category as a combination of each training embedding in thecluster of training embeddings represented by the patient category; anddetermining the classification score for the patient category based on asimilarity measure between: (i) the embedding of the multi-modal datacharacterizing the patient, and (ii) the centroid embedding for thepatient category.

In some implementations, the encoder neural network includes arespective encoder subnetwork corresponding to each modality of theplurality of modalities, and processing the multi-modal datacharacterizing the patient using the encoder neural network to generatethe embedding of the multi-modal data characterizing the patientcomprises: processing, for each of the plurality of modalities, therespective feature representation for the modality using thecorresponding encoder subnetwork of the encoder neural network togenerate a respective encoder subnetwork output; and combining therespective encoder subnetwork output of each encoder subnetwork togenerate the embedding of the multi-modal data characterizing thepatient.

In some implementations, the encoder neural network has been jointlytrained with a decoder neural network, wherein jointly training thedecoder neural network comprises, for each training patient in a set oftraining patients: processing training multi-modal data characterizingthe training patient using the encoder neural network, in accordancewith current values of a set of encoder neural network parameters, togenerate an embedding of the training multi-modal data characterizingthe training patient; processing the embedding of the trainingmulti-modal data characterizing the training patient using the decoderneural network, in accordance with current values of a set of decoderneural network parameters, to generate a reconstruction of the trainingmulti-modal data characterizing the training patient; and updating thecurrent values of the set of encoder neural network parameters and thecurrent values of the set of decoder neural network parameters usinggradients of a reconstruction loss function that measures an error inthe reconstruction of the training multi-modal data characterizing thetraining patient.

In some implementations, the reconstruction loss function comprises aplurality of scaling factors that each scale a respective term in thereconstruction loss function that measures an error in thereconstruction of a corresponding proper subset of a set of featuredimensions of the training multi-modal data characterizing the trainingpatient; and each of the plurality of scaling factors has a respectivevalue that is based on a relevance of the corresponding proper subset ofthe set of feature dimensions of the training multi-modal data to amedical condition.

In some implementations, the respective value of each of the pluralityof scaling factors is based on a relevance of the corresponding propersubset of the set of feature dimensions to a treatment for the medicalcondition.

In some implementations, for one or more feature dimensions, thereconstruction loss comprises a respective scaling factor correspondingto the feature dimension and a value of the scaling factor correspondingto the feature dimension is determined by operations comprising:obtaining, for each of one or more reference patients: (i) apre-treatment value of a feature corresponding to the feature dimensionthat is measured for the reference patient prior to the referencepatient receiving the treatment, and (ii) a post-treatment value of thefeature corresponding to the feature dimension that is measured for thereference patient after the reference patient receives the treatment;and determining the value of the scaling factor corresponding to thefeature dimension based on, for each reference patient, thepre-treatment value and the post-treatment value corresponding to thefeature dimension for the reference patient.

In some implementations, classifying the patient as being included in acorresponding patient category from the set of patient categories basedthe classification scores comprises: classifying the patient as beingincluded in a patient category with a highest classification score.

In some implementations, the plurality of modalities include afunctional magnetic resonance imaging (fMRI) modality, and the featurerepresentation for the fMRI modality is derived from a series of fMRIimages that each correspond to a respective time point in a sequence oftime points and characterize blood flow in a brain of the patient at thetime point.

In some implementations, the plurality of modalities include a clinicalscale modality, and the feature representation of the clinical scalemodality represents data obtained from a clinical interview with thepatient.

In some implementations, the plurality of modalities include anelectroencephalography (EEG) modality, and the feature representation ofthe EEG modality is derived from a plurality of voltage waveforms thatare each measured by a respective electrode placed in proximity to abrain of the patient.

In some implementations, the plurality of modalities include a genomicsmodality, and the feature representation of the genomics modality isderived from data defining a sequence of nucleotides from a genome ofthe patient.

In some implementations, the plurality of modalities include an audiomodality, and the feature representation of the audio modality isderived from audio data that represents a sequence of words spoken bythe patient.

According to another aspect, there is provided a method comprising:obtaining a plurality of training examples, wherein each trainingexample corresponds to a respective patient and includes multi-modaldata, having a plurality of feature dimensions, that characterizes thepatient; jointly training an encoder neural network having a set ofencoder parameters and a decoder neural network having a set of decoderparameters on the plurality of training examples, comprising, for eachtraining example: processing the multi-modal data from the trainingexample using the encoder neural network, in accordance with currentvalues of the encoder parameters, to generate an embedding of themulti-modal data from the training example; processing the embedding ofthe multi-modal data from the training example using the decoder neuralnetwork, in accordance with current values of the decoder parameters, togenerate a reconstruction of the multi-modal data from the trainingexample; and updating the current values of the set of encoderparameters and the current values of the set of decoder parameters usinggradients of a reconstruction loss function that measures an error inthe reconstruction of the multi-modal data from the training example,wherein: the reconstruction loss function comprises a plurality ofscaling factors that each scale a respective term in the reconstructionloss function that measures an error in the reconstruction of acorresponding proper subset of the feature dimensions of the multi-modaldata from the training example, and each of the plurality of scalingfactors has a respective value that is based on a relevance of thecorresponding proper subset of the feature dimensions of the multi-modaldata from the training example to a particular medical condition.

In some implementations, the respective value of each of the pluralityof scaling factors is based on a relevance of the corresponding propersubset of the feature dimensions of the multi-modal data from thetraining example to a treatment for the particular medical condition.

In some implementations, for one or more feature dimensions, thereconstruction loss comprises a respective scaling factor correspondingto the feature dimension and a value of the scaling factor correspondingto the feature dimension is determined by operations comprising:obtaining, for each of one or more reference patients: (i) apre-treatment value of a feature corresponding to the feature dimensionthat is measured for the reference patient prior to the referencepatient receiving the treatment, and (ii) a post-treatment value of thefeature corresponding to the feature dimension that is measured for thereference patient after the reference patient receives the treatment;and determining the value of the scaling factor corresponding to thefeature dimension based on, for each reference patient, thepre-treatment value and the post-treatment value corresponding to thefeature dimension for the reference patient.

In some implementations, determining the value of the scaling factorcorresponding to the feature dimension based on, for each referencepatient, the pre-treatment value and the post-treatment valuecorresponding to the feature dimension for the reference patientcomprises: determining a set of difference values, wherein eachdifference value represents a difference between the pre-treatment valueand the post-treatment value corresponding to the feature dimension fora respective reference patient; determining a measure of centraltendency of the set of difference values; and determining the value ofthe scaling factor corresponding to the feature dimension based on themeasure of central tendency of the set of difference values.

In some implementations, determining a measure of central tendency ofthe set of difference values comprises: determining a mean or median ofthe set of difference values.

In some implementations, determining the value of the scaling factorcorresponding to the feature dimension based on the measure of centraltendency of the set of difference values comprises: determining thevalue of the scaling factor corresponding to the feature dimension basedon an absolute value of the measure of central tendency of the set ofdifference values.

In some implementations, the treatment for the particular medicalcondition comprises a drug, and for each of one or more of the scalingfactors: the proper subset of the feature dimensions corresponding tothe scaling factor characterize a brain region in a brain parcellation;and a value of the scaling factor is determined by based on a positronemission tomography (PET) image of a brain of a reference patient thatis captured after the drug has been labelled with a radiotracer andadministered to the reference patient.

In some implementations, determining the value of the scaling factorbased on the PET image of the brain of the reference patient comprises:determining a penetration score for the brain region that characterizesa concentration of the drug in the brain region in the brain of thereference patient based on a measure of central tendency of intensitiesof voxels included in the brain region in the PET image of the brain ofthe reference patient; and determining the value of the scaling factorbased on the penetration score for the brain region.

In some implementations, for one or more feature dimensions, thereconstruction loss comprises a respective scaling factor correspondingto the feature dimension and a value of the scaling factor correspondingto the feature dimension is determined by operations comprising:obtaining, for each reference patient in a set of reference patients:(i) a value of a feature corresponding to the feature dimension that ismeasured for the reference patient, and (ii) a label indicating whetherthe reference patient has been diagnosed with the medical condition;determining a correlation between values of the feature corresponding tothe feature dimension and diagnosis with the medical condition; anddetermining the scaling factor corresponding to the feature dimensionbased on the correlation between values of the feature corresponding tothe feature dimension and diagnosis with the medical condition.

In some implementations, the treatment for the particular medicalcondition involves administering a drug to treat the particular medicalcondition.

In some implementations, scaling factors corresponding to proper subsetsof the feature dimensions of the multi-modal data that are more relevantto the particular medical condition have higher values than scalingfactors corresponding to proper subsets of the feature dimensions of themulti-modal data that are less relevant to the particular medicalcondition.

In some implementations, the particular medical condition is apsychiatric medical condition.

In some implementations, for each training example, processing themulti-modal data from the training example using the encoder neuralnetwork to generate the embedding of the multi-modal data from thetraining example comprises: processing the multi-modal data from thetraining example to generate parameters defining a posterior probabilitydistribution over a latent space; and sampling the embedding of themulti-modal data from the training example in accordance with theposterior probability distribution over the latent space.

In some implementations, the multi-modal data from each training examplecomprises a respective feature representation for each of a plurality ofmodalities, the encoder neural network includes a respective encodersubnetwork corresponding to each modality of the plurality ofmodalities, and for each training example, processing the multi-modaldata from the training example to generate the posterior probabilitydistribution over the latent space comprises: processing, for each ofthe plurality of modalities, the respective feature representation forthe modality using the corresponding encoder subnetwork of the encoderneural network to generate a respective encoder subnetwork output; andcombining the respective encoder subnetwork output of each encodersubnetwork to generate the parameters defining the posterior probabilitydistribution over the latent space.

In some implementations, the decoder neural network includes arespective decoder subnetwork corresponding to each modality of theplurality of modalities, and for each training example, processing theembedding of the multi-modal data from the training example to generatethe reconstruction of the multi-modal data from the training examplecomprises: processing, for each of the plurality of modalities, theembedding of the multi-modal data from the training example using thecorresponding decoder subnetwork of the decoder neural network togenerate a reconstruction of the feature representation for themodality.

In some implementations, for each training example, the multi-modal datafrom the training example comprises a respective feature representationfor each of a plurality of modalities.

In some implementations, the plurality of modalities include afunctional magnetic resonance imaging (fMRI) modality, and the featurerepresentation for the fMRI modality is derived from a series of fMRIimages that each correspond to a respective time point in a sequence oftime points and characterize blood flow in a brain of the patient at thetime point.

In some implementations, the plurality of modalities include a clinicalscale modality, and the feature representation of the clinical scalemodality represents data obtained from a clinical interview with thepatient.

According to another aspect, there is provided a method comprising:obtaining a plurality of training examples, wherein each trainingexample corresponds to a respective patient and includes multi-modaldata, having a plurality of feature dimensions, that characterizes thepatient; jointly training an encoder neural network and a decoder neuralnetwork on the plurality of training examples, wherein: the encoderneural network is configured to process input multi-modal datacharacterizing an input patient to generate an embedding of the inputmulti-modal data in a multidimensional latent space; and the decoderneural network is configured to process the embedding of the inputmulti-modal data to generate a reconstruction of the input multi-modaldata; and generating a plurality of multi-modal data archetypes thateach correspond to a respective dimension of the latent space,comprising, for each multi-modal data archetype: processing a predefinedembedding that represents the corresponding dimension of the latentspace using the decoder neural network to generate multi-modal data,having the plurality of feature dimensions, that defines the multi-modaldata archetype.

In some implementations, the method further comprises generating arespective representation of each of the plurality of multi-modal dataarchetypes, comprising, for each multi-modal data archetype: generatinga respective intensity score for each of the plurality of featuredimensions of the multi-modal data archetype based on: (i) a value ofthe feature dimension of the multi-modal data archetype, and (ii) adistribution defined by values of the feature dimension of multi-modaldata included in the plurality of training examples; wherein therepresentation of the multi-modal data archetype comprises therespective intensity score for each of the plurality of featuredimensions of the multi-modal data archetype.

In some implementations, for each of the plurality of feature dimensionsof the multi-modal data archetype, the intensity score for the featuredimension characterizes a likelihood of the value of the featuredimension of the multi-modal data archetype under the distributiondefined by values of the feature dimension of multi-modal data includedin the plurality of training examples.

In some implementations, for each of the plurality of feature dimensionsof the multi-modal data archetype, determining the intensity score forthe feature dimension comprises: determining a mean and a standarddeviation of the distribution defined by values of the feature dimensionof multi-modal data included in the plurality of training examples; anddetermining the intensity score for the feature dimension using the meanand the standard deviation of the distribution defined by values of thefeature dimension of multi-modal data included in the plurality oftraining examples.

In some implementations, the method further comprises providing therepresentations of the multi-modal data archetypes as explainabilitydata that explains the dimensions of the latent space.

In some implementations, the predefined embeddings that represent thedimensions of the latent space define a set of basis embeddings for thelatent space, wherein each latent embedding in the latent space can beexpressed as a linear combination of the set of basis embeddings for thelatent space.

In some implementations, for each dimension of the latent space, thepredefined embedding representing the dimension of the latent space is aunit embedding having: (i) a non-zero value in the dimension, and (ii) azero value in each other dimension.

In some implementations, jointly training the encoder neural network andthe decoder neural network comprises, for each training example:processing the multi-modal data from the training example using theencoder neural network, in accordance with current values of the encoderparameters, to generate an embedding of the multi-modal data from thetraining example; processing the embedding of the multi-modal data fromthe training example using the decoder neural network, in accordancewith current values of the decoder parameters, to generate areconstruction of the multi-modal data from the training example; andupdating the current values of the set of encoder parameters and thecurrent values of the set of decoder parameters using gradients of areconstruction loss function that measures an error in thereconstruction of the multi-modal data from the training example,wherein: the reconstruction loss function comprises a plurality ofscaling factors that each scale a respective term in the reconstructionloss function that measures an error in the reconstruction of acorresponding proper subset of the feature dimensions of the multi-modaldata from the training example, and each of the plurality of scalingfactors has a respective value that is based on a relevance of thecorresponding proper subset of the feature dimensions of the multi-modaldata from the training example to a particular medical condition.

In some implementations, the respective value of each of the pluralityof scaling factors is based on a relevance of the corresponding propersubset of the feature dimensions of the multi-modal data from thetraining example to diagnosing the particular medical condition.

In some implementations, the respective value of each of the pluralityof scaling factors is based on a relevance of the corresponding propersubset of the feature dimensions of the multi-modal data from thetraining example to a treatment for the particular medical condition.

In some implementations, scaling factors corresponding to proper subsetsof the feature dimensions of the multi-modal data that are more relevantto the particular medical condition have higher values than scalingfactors corresponding to proper subsets of the feature dimensions of themulti-modal data that are less relevant to the particular medicalcondition.

In some implementations, jointly training the encoder neural network andthe decoder neural network comprises, for each latent dimension in aproper subset of a plurality of latent dimensions of the latent space:obtaining multi-modal data that defines a target multi-modal dataarchetype, having a plurality of feature dimensions, that corresponds tothe latent dimension; processing a predefined embedding that representsthe latent dimension using the decoder neural network to generatemulti-modal data, having the plurality of feature dimensions, thatdefines a predicted multi-modal data archetype corresponding to thelatent dimension; and updating the values of the set of decoderparameters using gradients of a loss function that measures an errorbetween: (i) the predicted multi-modal data archetype corresponding tothe latent dimension, and (ii) the target multi-modal data archetypecorresponding to the latent dimension.

In some implementations, for each training example, the multi-modal datafrom the training example comprises a respective feature representationfor each of a plurality of modalities.

In some implementations, the plurality of modalities include afunctional magnetic resonance imaging (fMRI) modality, and the featurerepresentation for the fMRI modality is derived from a series of fMRIimages that each correspond to a respective time point in a sequence oftime points and characterize blood flow in a brain of the patient at thetime point.

In some implementations, the plurality of modalities include anelectroencephalography (EEG) modality, and the feature representation ofthe EEG modality is derived from a plurality of voltage waveforms thatare each measured by a respective electrode placed in proximity to abrain of the patient.

In some implementations, the plurality of modalities include a genomicsmodality, and the feature representation of the genomics modality isderived from data defining a sequence of nucleotides from a genome ofthe patient.

In some implementations, the method further comprises: processing theplurality of multi-modal data archetypes to identify one or moredimensions to be removed from the latent space; and removing theidentified dimensions from the latent space.

In some implementations, processing the plurality of multi-modal dataarchetypes to identify one or more dimensions to be removed from thelatent space comprises, for each multi-modal data archetype: determiningwhether a value of a feature dimension of the multi-modal data archetypesatisfies a threshold; and determining whether a corresponding dimensionof the latent space should be removed based at least in part on thewhether the value of the feature dimension of the multi-modal dataarchetype satisfies the threshold.

According to another aspect, there is provided a method comprising:obtaining a plurality of training examples, wherein each trainingexample corresponds to a respective patient and includes multi-modaldata that characterizes the patient; and jointly training an encoderneural network and a decoder neural network on the plurality of trainingexamples, wherein: the encoder neural network is configured to processinput multi-modal data characterizing an input patient, in accordancewith values of a set of encoder parameters, to generate an embedding ofthe input multi-modal data in a latent space having a plurality oflatent dimensions; and the decoder neural network is configured toprocess the embedding of the input multi-modal data, in accordance withvalues of a set of decoder parameters, to generate a reconstruction ofthe input multi-modal data; wherein the training comprises, for eachlatent dimension in a proper subset of the plurality of latentdimensions of the latent space: obtaining multi-modal data that definesa target multi-modal data archetype, having a plurality of featuredimensions, that corresponds to the latent dimension; processing apredefined embedding that represents the latent dimension using thedecoder neural network to generate multi-modal data, having theplurality of feature dimensions, that defines a predicted multi-modaldata archetype corresponding to the latent dimension; and updating thevalues of the set of decoder parameters using gradients of an archetypeloss function that measures an error between: (i) the predictedmulti-modal data archetype corresponding to the latent dimension, and(ii) the target multi-modal data archetype corresponding to the latentdimension.

In some implementations, for each latent dimension in the proper subsetof the plurality of latent dimensions of the latent space, thepredefined embedding that represents the latent dimension is a basisembedding from a set of basis embeddings that define a basis of thelatent space, wherein each latent embedding in the latent space can beexpressed as a linear combination of the set of basis embeddings.

In some implementations, for each latent dimension in the proper subsetof the plurality of latent dimensions of the latent space, thepredefined embedding that represents the latent dimension is a unitembedding having: (i) a non-zero value in the latent dimension, and (ii)a zero value in each other dimension.

In some implementations, the training further comprises, for each latentdimension in the proper subset of the plurality of latent dimensions ofthe latent space: processing the multi-modal data that defines thetarget multi-modal data archetype corresponding to the latent dimensionusing the encoder neural network to generate an embedding of the targetmulti-modal data archetype corresponding to the latent dimension; andupdating the values of the set of encoder parameters using gradients ofthe archetype loss, wherein the archetype loss further measures an errorbetween: (i) the embedding of the target multi-modal data archetypecorresponding to the latent dimension, and (ii) the predefined embeddingthat represents the latent dimension.

In some implementations, obtaining the target multi-modal dataarchetypes comprises, prior to training the decoder neural network usingthe archetype loss function: jointly training the encoder neural networkand the decoder neural network on the plurality of training examplesover one or more initial training iterations to optimize an objectivefunction that excludes the archetype loss function; processing, for eachof the plurality of latent dimensions of the latent space, a predefinedembedding that represents the latent dimension using the decoder neuralnetwork to generate multi-modal data that defines a candidatemulti-modal data archetype corresponding to the latent dimension; andidentifying one or more of the candidate multi-modal data archetypes asbeing target multi-modal data archetypes.

In some implementations, identifying one or more of the candidatemulti-modal data archetypes as being target multi-modal data archetypescomprises: providing, to a user, a respective representation of eachcandidate multi-modal data archetype; and receiving, from the user, dataselecting one or more of the candidate multi-modal data archetypes astarget multi-modal data archetypes.

In some implementations, for each latent dimension in the proper subsetof the plurality of latent dimensions of the latent space, the archetypeloss function comprises a plurality of scaling factors that each scale arespective term in the archetype loss function that measures an errorbetween: (i) the predicted multi-modal data archetype corresponding tothe latent dimension, and (ii) the target multi-modal data archetypecorresponding to the latent dimension, along a corresponding propersubset of the feature dimensions.

In some implementations, each of the plurality of scaling factors has arespective value that is based on a relevance of the correspondingproper subset of the feature dimensions to a particular medicalcondition.

In some implementations, the respective value of each of the pluralityof scaling factors is based on a relevance of the corresponding propersubset of the feature dimensions to a treatment for the particularmedical condition.

In some implementations, for one or more feature dimensions, thereconstruction loss comprises a respective scaling factor correspondingto the feature dimension and a value of the scaling factor correspondingto the feature dimension is determined by operations comprising:obtaining, for each of one or more reference patients: (i) apre-treatment value of a feature corresponding to the feature dimensionthat is measured for the reference patient prior to the referencepatient receiving the treatment, and (ii) a post-treatment value of thefeature corresponding to the feature dimension that is measured for thereference patient after the reference patient receives the treatment;and determining the value of the scaling factor corresponding to thefeature dimension based on, for each reference patient, thepre-treatment value and the post-treatment value corresponding to thefeature dimension for the reference patient.

In some implementations, determining the value of the scaling factorcorresponding to the feature dimension based on, for each referencepatient, the pre-treatment value and the post-treatment valuecorresponding to the feature dimension for the reference patientcomprises: determining a set of difference values, wherein eachdifference value represents a difference between the pre-treatment valueand the post-treatment value corresponding to the feature dimension fora respective reference patient; determining a measure of centraltendency of the set of difference values; and determining the value ofthe scaling factor corresponding to the feature dimension based on themeasure of central tendency of the set of difference values.

In some implementations, scaling factors corresponding to proper subsetsof the feature dimensions that are more relevant to the particularmedical condition have higher values than scaling factors correspondingto proper subsets of the feature dimensions that are less relevant tothe particular medical condition.

In some implementations, for each training example, the multi-modal datafrom the training example comprises a respective feature representationfor each of a plurality of modalities.

In some implementations, the plurality of modalities include afunctional magnetic resonance imaging (fMRI) modality, and the featurerepresentation for the fMRI modality is derived from a series of fMRIimages that each correspond to a respective time point in a sequence oftime points and characterize blood flow in a brain of the patient at thetime point.

In some implementations, the plurality of modalities include a clinicalscale modality, and the feature representation of the clinical scalemodality represents data obtained from a clinical interview with thepatient.

In some implementations, the plurality of modalities include anelectroencephalography (EEG) modality, and the feature representation ofthe EEG modality is derived from a plurality of voltage waveforms thatare each measured by a respective electrode placed in proximity to abrain of the patient.

In some implementations, the plurality of modalities include a genomicsmodality, and the feature representation of the genomics modality isderived from data defining a sequence of nucleotides from a genome ofthe patient.

In some implementations, the plurality of modalities include an audiomodality, and the feature representation of the audio modality isderived from audio data that represents a sequence of words spoken bythe patient.

According to another aspect, there is provided a method performed by oneor more computers, the method comprising: receiving multi-modal datacharacterizing a target subject; generating conditioning data forconditioning the multi-modal data characterizing the target subjectbased on a population of reference subjects, comprising: receiving, foreach reference subject in the population of reference subjects, afeature representation of the reference subject corresponding to areference modality and having a plurality of feature dimensions; andgenerating the conditioning data based on the feature representations ofthe reference subjects; applying the conditioning data to themulti-modal data characterizing the target subject; and after applyingthe conditioning data to the multi-modal data characterizing the targetsubject, processing the multi-modal data characterizing the targetsubject using a machine learning model to generate a machine learningmodel output for the target subject.

In some implementations, generating the conditioning data based on thefeature representations of the reference subjects comprises:determining, for each pair of feature dimensions comprising a firstfeature dimension and a second feature dimension from the plurality offeature dimensions, a respective correlation coefficient for the pair offeature dimensions that measures a correlation between: (i) a value ofthe first feature dimension in the feature representations of thereference subjects, and (ii) a value of the second feature dimension inthe feature representations of the reference subjects; and generatingthe conditioning data based on the correlation coefficients.

In some implementations, for each reference subject in the population ofreference subjects: the plurality of feature dimensions in the featurerepresentation of the reference subject comprise a respective featuredimension corresponding to each protein in a predefined set of proteins;and the value of each feature dimension corresponding to a proteindefines an expression level of the protein in the reference subject.

In some implementations, for each reference subject in the population ofreference subjects: the plurality of feature dimensions in the featurerepresentation of the reference subject comprise a respective featuredimension corresponding to each gene in a predefined set of genes; andthe value of each feature dimension corresponding to a gene defines anexpression level of the gene in the reference subject.

In some implementations, the method further comprises receiving, foreach reference subject in the population of reference subjects, a labelthat defines: (i) whether the reference subject has a particular medicalcondition, or (ii) whether the reference subject has responded to atreatment for a particular medical condition; wherein generating theconditioning data based on the feature representations of the referencesubjects comprises: determining, for each feature dimension from theplurality of feature dimensions, a respective correlation coefficientthat measures a correlation between: (i) a value of the featuredimension in the feature representations of the reference subjects, and(ii) the labels of the reference subjects; and generating the conditiondata based on the correlation coefficients.

In some implementations, for each reference subject in the population ofreference subjects, receiving a feature representation of the referencesubject corresponding to a reference modality comprises: receiving apre-treatment feature representation of the reference subject capturedbefore a medical treatment is applied to the reference subject; andreceiving a post-treatment feature representation of the referencesubject captured after the medical treatment is applied to the referencesubject.

In some implementations, generating the conditioning data based on thefeature representations of the reference subjects comprises: generating,for each reference subject, a differential feature representation of thereference subject as a difference between: (i) the pre-treatment featurerepresentation of the reference subject, and (ii) the post-treatmentfeature representation of the reference subject; generating theconditioning data as a combination of the differential featurerepresentations of the reference subjects.

In some implementations, generating the conditioning data as acombination of the differential feature representations of referencesubjects comprises: generating the conditioning data as an average ofthe differential feature representations of the reference subjects.

In some implementations, the pre-treatment feature representation andthe post-treatment feature representation of the reference subject arecaptured using functional magnetic resonance imaging (fMRI).

In some implementations, the pre-treatment feature representation andthe post-treatment feature representation of the reference subject arecaptured using positron emission tomography (PET) imaging.

In some implementations, applying the conditioning data to themulti-modal data characterizing the target subject comprises: pointwisemultiplying each of a plurality of feature dimensions of the multi-modaldata by a corresponding dimension of the conditioning data.

In some implementations, the conditioning data is represented as atwo-dimensional (2D) matrix of numerical values, and wherein applyingthe conditioning data to the multi-modal data characterizing the targetsubject comprises: matrix multiplying a plurality of feature dimensionsof the multi-modal data by the 2D matrix of numerical valuesrepresenting the conditioning data.

In some implementations, applying the conditioning data to themulti-modal data characterizing the target subject comprises: applyingthe conditioning data to a plurality of feature dimensions of themulti-modal data corresponding to a target modality, wherein the targetmodality is a different modality than the reference modality used togenerate the conditioning data.

In some implementations, the machine learning model comprises an encoderneural network, and wherein processing the multi-modal datacharacterizing the target subject using the machine learning modelcomprises: processing the multi-modal data characterizing the targetsubject using the encoder neural network to generate an embedding of themulti-modal data characterizing the target subject; determining arespective classification score for each patient category in a set ofpatient categories based on the embedding of the multi-modal datacharacterizing the target subject; and classifying the target subject asbeing included in a corresponding patient category from the set ofpatient categories based on the classification scores.

In some implementations, processing the multi-modal data characterizingthe target subject using the machine learning model comprises:processing the multi-modal data characterizing the target subject usingthe machine learning model, in accordance with values of a plurality ofmachine learning model parameters, to generate a predictioncharacterizing the target subject.

In some implementations, the prediction characterizing the targetsubject comprises a prediction for whether the target subject has aparticular medical condition.

In some implementations, the multi-modal data characterizing the targetsubject comprises a respective feature representation for each of aplurality of modalities.

In some implementations, each of the plurality of modalities correspondsto a respective sensor, and wherein the feature representation of eachmodality is based on data generated by the corresponding sensor.

According to another aspect, there is provided a method comprising:jointly training an encoder neural network having a set of encoderparameters and a decoder neural network having a set of decoderparameters, comprising, at each of a plurality of training iterations:obtaining a batch of training examples, wherein each training examplecorresponds to a respective subject and includes multi-modal data thatcharacterizes the subject; processing the multi-modal data from eachtraining example using the encoder neural network, in accordance withcurrent values of the encoder parameters, to generate a respectiveembedding of the multi-modal data from each training example in a latentspace; processing the embedding of the multi-modal data from eachtraining example using the decoder neural network, in accordance withcurrent values of the decoder parameters, to generate a respectivereconstruction of the multi-modal data from each training example;clustering a set of embeddings into a plurality of clusters ofembeddings, wherein each cluster of embeddings includes a plurality ofembeddings from the set of embeddings, and wherein the set of embeddingsincludes the respective embedding of the multi-modal data from eachtraining example; determining a clustering loss based on the clusteringof the set of embeddings into the plurality of clusters of embeddings;and updating the current values of the set of encoder parameters and thecurrent values of the set of decoder parameters using gradients of anobjective function that depends on: (i) a respective error in thereconstruction of the multi-modal data from each training example, and(ii) the clustering loss.

In some implementations, each embedding in the set of embeddings isassociated with a cluster label that identifies a cluster that includesthe embedding, and wherein determining the clustering loss based on theclustering of the set of embeddings into the plurality of clusters ofembeddings comprises: designating a proper subset of the set ofembeddings as being training embeddings; training a classificationmachine learning model, comprising, for each training embedding,training the classification machine learning model to process thetraining embedding to predict the cluster label of the trainingembedding; and after training the classification machine learning model,determining the clustering loss using the classification machinelearning model.

In some implementations, determining the clustering loss using theclassification machine learning model comprises: designating a propersubset of the set of embeddings as validation embeddings; evaluating aclassification accuracy of the classification machine learning model ona task of processing each validation embedding to predict the clusterlabel of the validation embedding; and determining the clustering lossbased on the classification accuracy of the classification machinelearning model.

In some implementations, the set of validation embeddings excludes anytraining embeddings.

In some implementations, updating the current values of the set ofencoder parameters using gradients of the objective function thatdepends on the clustering loss encourages an increase in theclassification accuracy of the classification machine learning model.

In some implementations, the classification machine learning modelcomprises a neural network model.

In some implementations, each embedding in the set of embeddings isassociated with: (i) a cluster label that identifies a cluster thatincludes the embedding, and (ii) a set of confounding features; whereindetermining the clustering loss based on the clustering of the set ofembeddings into the plurality of clusters of embeddings comprises:designating a proper subset of the set of embeddings as being trainingembeddings; training a classification machine learning model,comprising, for each training embedding, training the classificationmachine learning model to process the set of confounding featurescorresponding to the training embedding to predict the cluster label ofthe training embedding; and after training the classification machinelearning model, determining the clustering loss using the classificationmachine learning model.

In some implementations, determining the clustering loss using theclassification machine learning model comprises: designating a propersubset of the set of embeddings as validation embeddings; evaluating aclassification accuracy of the classification machine learning model ona task of processing the set of confounding features corresponding toeach validation embedding to predict the cluster label of the validationembedding; and determining the clustering loss based on theclassification accuracy of the classification machine learning model.

In some implementations, the set of confounding features are designatedas being substantially irrelevant to a medical condition.

In some implementations, the set of confounding features are designatedas being substantially irrelevant to a treatment for a medicalcondition.

In some implementations, for each embedding, the set of confoundingfeatures are not included in multi-modal data processed by the encoderneural network to generate the embedding.

In some implementations, for each embedding, the corresponding set ofconfounding features comprise: features of a sensor that captured sensordata included in the multi-modal data processed by the encoder neuralnetwork to generate the embedding, or features of an acquisitionprotocol used to acquire a portion of the multi-modal data processed bythe encoder neural network to generate the embedding, or both.

In some implementations, updating the current values of the set ofencoder parameters using gradients of the objective function thatdepends on the clustering loss encourages a decrease in theclassification accuracy of the classification machine learning model.

In some implementations, updating the current values of the set ofencoder parameters using gradients of the objective function thatdepends on the clustering loss encourages confounding featurescorresponding to embeddings with different cluster labels to become moreentangled in a confounding feature space.

In some implementations, clustering the set of embeddings into theplurality of clusters of embeddings comprises applying a k-meansclustering operation to the set of embeddings.

In some implementations, the method further comprises: outputting theencoder neural network and the decoder neural network after the jointtraining of the encoder neural network and the decoder neural network.

In some implementations, for each training example, the multi-modal dataincluded in the training example comprises a respective featurerepresentation for each of a plurality of modalities.

In some implementations, for each training example, the plurality ofmodalities include a functional magnetic resonance imaging (fMRI)modality, the feature representation for the fMRI modality is derivedfrom a series of fMRI images that each correspond to a respective timepoint in a sequence of time points and characterize blood flow in abrain of the corresponding subject at the time point.

According to another aspect, there is provided a method performed by oneor more computers, the method comprising: receiving multi-modal datacharacterizing a patient, wherein the multi-modal data comprises arespective feature representation for each of a plurality of modalities;processing the multi-modal data characterizing the patient using amachine learning model, in accordance with values of a set of machinelearning model parameters, to generate a patient classification thatclassifies the patient as being included in a patient category from aset of patient categories; determining an uncertainty measure thatcharacterizes an uncertainty of the patient classification generated bythe machine learning model; and generating a clinical recommendation formedical treatment of the patient based on: (i) the patientclassification, and (ii) the uncertainty measure that characterizes theuncertainty of the patient classification.

In some implementations, generating the patient classification thatclassifies the patient as being included in a patient category from theset of patient categories comprises: generating, by the machine learningmodel, a respective classification score for each patient category inthe set of patient categories; and classifying the patient as beingincluded in the patient category based on the classification scores.

In some implementations, classifying the patient as being included inthe patient category based on the classification scores comprises:classifying the patient as being included in a patient category having ahighest classification score.

In some implementations, determining the uncertainty measure thatcharacterizes the uncertainty of the patient classification generated bythe machine learning model comprises: processing the classificationscores for the patient categories to identify a trust set for thepatient, wherein: the trust set specifies a plurality of patientcategories that form a proper subset of the set of patient categories;and the patient is predicted to be included in a patient category withinthe trust set with at least a threshold probability; and determining theuncertainty measure based on the trust set for the patient.

In some implementations, determining the uncertainty measure based onthe trust set for the patient comprises: determining the uncertaintymeasure based on a number of patient categories included in the trustset for the patient.

In some implementations, processing the classification scores for thepatient categories to identify the trust set for the patient comprises:determining an ordering of the patient categories in the set of patientcategories based on the classification scores for the patientcategories; identifying that, for a particular patient category: (i) asum of the classification scores for patient categories up to andincluding the particular patient category, in the ordering of thepatient categories, exceeds a predefined threshold, and (ii) a sum ofthe classification scores for patient categories strictly preceding theparticular patient category does not exceed the predefined threshold;and determining that each patient category up to and including theparticular patient category, in the ordering of the patient categories,is included in the trust set.

In some implementations, the predefined threshold is determined byoperations comprising: obtaining a set of calibration examples, whereineach calibration example corresponds to a respective calibration patientand comprises: (i) multi-modal data characterizing the calibrationpatient, and (ii) a target patient category of the calibration patient;determining a respective calibration score for each calibration patient;and determining the predefined threshold as a quantile of thecalibration scores.

In some implementations, determining the predefined threshold as aquantile of the calibration scores comprises: determining the predefinedthreshold as an a-th quantile of the calibration scores, wherein a isbased on: (i) a number of calibration examples in the set of calibrationexamples, and (ii) the threshold probability for the trust set.

In some implementations, determining the respective calibration scorefor each calibration patient comprises, for each calibration patient:processing the multi-modal data characterizing the calibration patientusing the machine learning model to generate a respective classificationscore for each patient category in the set of patient categories; anddetermining the calibration score for the calibration patient based onan error between: (i) the classification scores for the patientcategories, and (ii) the target patient category of the calibrationpatient.

In some implementations, the machine learning model comprises an encoderneural network, and wherein generating a respective classification scorefor each patient category in the set of patient categories comprises:processing the multi-modal data characterizing the patient using theencoder neural network to generate an embedding of the multi-modal datacharacterizing the patient; determining the respective classificationscore for each patient category in the set of patient categories basedon the embedding of the multi-modal data characterizing the patient.

In some implementations, wherein generating a clinical recommendationfor medical treatment of the patient comprises: evaluating a confidencecriterion based at least in part on the uncertainty measure thatcharacterizes the uncertainty of the patient classification; and inresponse to determining that the confidence criterion is satisfied,generating the clinical recommendation for the patient based on thepatient classification.

In some implementations, evaluating the confidence criterion comprises:determining that the uncertainty measure that characterizes theuncertainty of the patient classification satisfies an uncertaintythreshold.

In some implementations, evaluating the confidence criterion furthercomprises: determining that the patient category includes at least athreshold number of patients.

In some implementations, generating the clinical recommendation for thepatient based on the patient classification comprises: determining afraction of patients included in the patient category that have beendesignated as having responded to a medical treatment; and determiningthat the patient should receive the medical treatment based on thefraction of patients included in the patient category that have beendesignated as having responded to the medical treatment, wherein theclinical recommendation indicates that the patient should receive themedical treatment.

In some implementations, the plurality of modalities include afunctional magnetic resonance imaging (fMRI) modality, and the featurerepresentation for the fMRI modality is derived from a series of fMRIimages that each correspond to a respective time point in a sequence oftime points and characterize blood flow in a brain of the patient at thetime point.

In some implementations, the plurality of modalities include a genomicsmodality, and the feature representation of the genomics modality isderived from data defining a sequence of nucleotides from a genome ofthe patient.

In some implementations, the plurality of modalities include an audiomodality, and the feature representation of the audio modality isderived from audio data that represents a sequence of words spoken bythe patient.

In some implementations, the plurality of modalities include anelectroencephalography (EEG) modality, and the feature representation ofthe EEG modality is derived from a plurality of voltage waveforms thatare each measured by a respective electrode placed in proximity to abrain of the patient.

According to another aspect, there is provided a method performed by oneor more computers, the method comprising: obtaining, for each patient ina population of patients, multi-modal data that characterizes thepatient; processing, for each patient in the population of patients, themulti-modal data characterizing the patient using an encoder neuralnetwork to generate an embedding of the multi-modal data in a latentspace, wherein the embeddings of multi-modal data characterizing thepatients in the population of patients collectively define a set ofembeddings in the latent space; processing the set of embeddings in thelatent space to generate a set of parameters defining a region of thelatent space that encloses the set of embeddings in the latent space;processing: (i) the set of parameters defining the region of the latentspace, and (ii) the set of embeddings, to identify a proper subset ofthe embeddings in the set of embeddings as being archetype embeddings;and identifying the respective multi-modal data represented by eacharchetype embedding as a respective multi-modal data archetype.

In some implementations, the region of the latent space that enclosesthe set of embeddings in the latent space is a convex set.

In some implementations, the region of the latent space that enclosesthe set of embeddings in the latent space is a convex hull of the set ofembeddings in the latent space.

In some implementations, processing: (i) the set of parameters definingthe region of the latent space, and (ii) the set of embeddings, toidentify a proper subset of the embeddings in the set of embeddings asbeing archetype embeddings comprises: determining a set of vertices ofthe region enclosing the set of embeddings in the latent space; andidentifying the archetype embeddings using the set of vertices of theregion enclosing the set of embeddings in the latent space.

In some implementations, identifying the archetype embeddings using theset of vertices of the region enclosing the set of embeddings in thelatent space comprises, for each vertex: identifying an embedding in theset of embeddings that has a minimum distance to the vertex from amongthe embeddings in the set of embeddings as being an archetype embeddingcorresponding to the vertex.

In some implementations, the method further comprises, for eachmulti-modal data archetype: generating a respective intensity score foreach of a plurality of feature dimensions of the multi-modal dataarchetype based on: (i) a value of the feature dimension of themulti-modal data archetype, and (ii) a distribution defined by values ofthe feature dimension of multi-modal data across the population ofpatients; generating a representation of the multi-modal data archetypethat includes the respective intensity score for each of the pluralityof feature dimensions of the multi-modal data archetype.

In some implementations, for each of the plurality of feature dimensionsof the multi-modal data archetype, the intensity score for the featuredimension characterizes a likelihood of the value of the featuredimension of the multi-modal data archetype under the distributiondefined by values of the feature dimension of multi-modal data acrossthe population of patients.

In some implementations, for each of the plurality of feature dimensionsof the multi-modal data archetype, determining the intensity score forthe feature dimension comprises: determining a mean and a standarddeviation of the distribution defined by values of the feature dimensionof multi-modal data across the population of patients; and determiningthe intensity score for the feature dimension using the mean and thestandard deviation of the distribution defined by values of the featuredimension of multi-modal data across the population of patients.

In some implementations, the method further comprises providing therepresentations of the multi-modal data archetypes as explainabilitydata that explains patterns in multi-modal data across the population ofpatients.

In some implementations, the method further comprises: clustering theset of embeddings to generate a set of clusters of embeddings, whereineach cluster is represented by a respective archetype embedding; andidentifying each cluster of embeddings as representing a respectivepatient category.

In some implementations, clustering the set of embeddings to generatethe set of clusters of embeddings comprises, for one or more embeddingsin the set of embeddings: determining, for each archetype embedding, arespective distance between the embedding and the archetype embedding;and assigning the embedding to a cluster represented by an archetypeembedding having minimum distance to the embedding.

In some implementations, the encoder neural network has been jointlytrained with a decoder neural network on a set of training examples,wherein each training example corresponds to a respective patient andincludes multi-modal data that characterizes the patient, wherein: theencoder neural network is configured to process input multi-modal datacharacterizing an input patient to generate an embedding of the inputmulti-modal data in the latent space; and the decoder neural network isconfigured to process the embedding of the input multi-modal data togenerate a reconstruction of the input multi-modal data.

In some implementations, jointly training the encoder neural network andthe decoder neural network comprises, for each training example:processing the multi-modal data from the training example using theencoder neural network, in accordance with current values of the set ofencoder neural network parameters, to generate an embedding of themulti-modal data from the training example; processing the embedding ofthe multi-modal data from the training example using the decoder neuralnetwork, in accordance with current values of the set of decoder neuralnetwork parameters, to generate a reconstruction of the multi-modal datafrom the training example; and updating the current values of the set ofencoder neural network parameters and the current values of the set ofdecoder neural network parameters using gradients of a reconstructionloss function that measures an error in the reconstruction of themulti-modal data from the training example, wherein: the reconstructionloss function comprises a plurality of scaling factors that each scale arespective term in the reconstruction loss function that measures anerror in the reconstruction of a corresponding proper subset of a set offeature dimensions of the multi-modal data from the training example,and each of the plurality of scaling factors has a respective value thatis based on a relevance of the corresponding proper subset of the set offeature dimensions of the multi-modal data from the training example toa particular medical condition.

In some implementations, the respective value of each of the pluralityof scaling factors is based on a relevance of the corresponding propersubset of the set of feature dimensions of the multi-modal data from thetraining example to diagnosing the particular medical condition.

In some implementations, the respective value of each of the pluralityof scaling factors is based on a relevance of the corresponding propersubset of the set of feature dimensions of the multi-modal data from thetraining example to a treatment for the particular medical condition.

In some implementations, scaling factors corresponding to proper subsetsof the set of feature dimensions of the multi-modal data that are morerelevant to the particular medical condition have higher values thanscaling factors corresponding to proper subsets of the featuredimensions of the multi-modal data that are less relevant to theparticular medical condition.

In some implementations, for each patient, the multi-modal datacharacterizing the patient comprises a respective feature representationfor each of a plurality of modalities.

In some implementations, the plurality of modalities include afunctional magnetic resonance imaging (fMRI) modality, and the featurerepresentation for the fMRI modality is derived from a series of fMRIimages that each correspond to a respective time point in a sequence oftime points and characterize blood flow in a brain of the patient at thetime point.

According to another aspect, there is provided a method performed by oneor more computers, the method comprising: generating a drug signaturefor a drug, wherein: the drug signature comprises a respective impactscore for each of a plurality of features; and the impact score for afeature characterizes an impact, caused by administering a drug to oneor more entities, on a value of the feature measured for the one or moreentities; generating an embedding of the drug signature in a latentspace, comprising: generating a network input to an encoder neuralnetwork based on the drug signature; and processing the network inputgenerated based on the drug signature using the encoder neural networkto generate the embedding of the drug signature in the latent space; andprocessing: (i) the embedding of the drug signature in the latent space,and (ii) data defining a plurality of patient categories, to generate aplurality of response scores, wherein each response score corresponds toa respective patient category and characterizes a predicted response ofpatients included in the patient category to the drug.

In some implementations, generating the drug signature comprises:obtaining, for each of the entities: (i) a pre-treatment featurerepresentation of the entity that comprises, for each of the pluralityof features, a respective pre-treatment value of the feature that ismeasured for the entity prior to the drug being administered to theentity; and (ii) a post-treatment feature representation of the entitythat comprises, for each of the plurality of features, a respectivepost-treatment value of the feature that is measured for the entityafter the drug is administered to the entity; and generating the drugsignature based on the pre-treatment and post-treatment featurerepresentations of the entities.

In some implementations, generating the drug signature based on thepre-treatment and post-treatment feature representations of the entitiescomprises: generating, for each of the plurality of entities, arespective differential feature representation of the entity based on adifference between: (i) the pre-treatment feature representation of theentity, and (ii) the post-treatment feature representation of theentity; and generating the drug signature based on the differentialfeature representations of the entities.

In some implementations, generating the drug signature based on thedifferential feature representations of the entities comprises:generating a respective entity-specific drug signature for each of theentities based on the differential feature representation of the entity;and generating the drug signature by combining the entity-specific drugsignatures.

In some implementations, for each of the entities, generating theentity-specific drug signature for the entity comprises: element-wisedividing the differential feature representation for the entity by thepre-treatment feature representation of the entity.

In some implementations, generating the drug signature by combining theentity-specific drug signatures comprises: averaging the entity-specificdrug signatures.

In some implementations, the drug signature comprises one or more impactscores that each characterize an impact, caused by administering thedrug to the one or more entities, on a level of expression of arespective gene in the one or more entities.

In some implementations, the drug signature comprises one or more impactscores that each characterize an impact, caused by administering thedrug to the one or more entities, on a level of expression of arespective protein in the one or more entities.

In some implementations, the network input to the encoder neural networkincludes the drug signature.

In some implementations, each of the plurality of patient categories isdefined by a cluster of patient embeddings in the latent space, whereineach patient embedding corresponds to a respective patient and isgenerated by processing multi-modal data characterizing the patientusing the encoder neural network.

In some implementations, for each of the plurality of patientcategories, generating the response score for the patient categorycomprises: determining a respective similarity measure between: (i) theembedding of the drug signature, and (ii) each of one or more patientembeddings in the cluster of patient embeddings defining the patientcategory; and determining the response score for the patient categorybased on the similarity measures.

In some implementations, the method further comprises determining aranking of the plurality of patient categories based on the responsescores.

In some implementations, the method further comprises: determining thata new patient is included in a patient category of the plurality ofpatient categories; identifying the response score for the patientcategory of the new patient; and automatically generating arecommendation for whether the new patient should receive the drug basedat least in part on the response score for the patient category of thenew patient.

In some implementations, each of the one or more entities comprises acell.

In some implementations, each of the one or more entities comprises acollection of cells.

In some implementations, each of the one or more entities is a patient.

In some implementations, the encoder neural network has been trained toprocess multi-modal data characterizing patients.

In some implementations, the encoder neural network has been trained byoperations comprising: obtaining a plurality of training examples,wherein each training example corresponds to a respective patient andincludes multi-modal data that characterizes the patient; jointlytraining the encoder neural network along with a decoder neural networkon the plurality of training examples, comprising, for each trainingexample: processing the multi-modal data from the training example usingthe encoder neural network to generate an embedding of the multi-modaldata from the training example; processing the embedding of themulti-modal data from the training example using the decoder neuralnetwork to generate a reconstruction of the multi-modal data from thetraining example; and updating current values of a set of encoderparameters and current values of a set of decoder parameters usinggradients of a reconstruction loss function that measures an error inthe reconstruction of the multi-modal data from the training example.

Particular embodiments of the subject matter described in thisspecification can be implemented so as to realize one or more of thefollowing advantages.

The machine learning system described in this specification can processmulti-modal data characterizing a patient to generate a multi-modal dataembedding that represents the multi-modal data in a lower-dimensionallatent space. In particular, the machine learning system can processrespective multi-modal data characterizing each patient in a populationof patients to generate a set of multi-modal data embeddings distributedacross the latent space. The machine learning system can then apply aclustering operation to partition the set of multi-modal data embeddingsinto a set of clusters. The clustering of the multi-modal dataembeddings in the latent space defines a partition of the population ofpatients into a set of patient categories, i.e., where each patientcategory corresponds to a respective cluster of multi-modal dataembeddings in the latent space.

Each patient category can be understood to represent a “type” ofpatient, e.g., such that patients included in the same patient categoryare more likely to share similar characteristics. Conventionalapproaches for dividing populations of patients into patient categoriescan rely on criteria that are manually specified by human experts, e.g.,traditional medical taxonomic criteria, which are often basic in natureand rely on data from few modalities, in many cases, only one modality.In contrast, the machine learning system provides a data-driven approachfor automatically identifying patient categories based on complexpatterns and correlations in multi-modal data well beyond what could beanalyzed by a human or solely in the human mind.

Patient categories identified by the machine learning system can be useda basis for making inferences (predictions) about patients and formaking clinical decisions related to patient care. For example, thepatient categories identified by the machine learning system can be usedto identify types of patients that are more likely to respond well tocertain medical treatments, as will be described in more detail below.

The machine learning system generates embeddings of multi-modal datausing a deep neural network, referred to as an encoder neural network.The encoder neural network is configured to process multi-modal data inaccordance with values of a set of encoder neural network parameters toimplement a non-linear dimensionality reducing transformation that mapsthe multi-modal data to a corresponding embedding in the latent space.Generating multi-modal data embeddings using a deep neural network (asopposed to, e.g., a linear transformation) can increase the likelihoodthat the embeddings are readily separable into clusters, and that theclustering is “stable.” Clustering can be said to be stable, e.g., ifsimilar clusters are obtained by applying the clustering process todifferent patient populations.

Generally, multi-modal data characterizing patients is interpretable,e.g., because the value of each feature dimension of the multi-modaldata measures a real-world attribute, e.g., blood flow in a region of abrain. In contrast, the latent space (i.e., in which the machinelearning system clusters multi-modal data embeddings to identify patientcategories), is not directly interpretable. Lack of interpretability canlimit the applicability of machine learning systems and their outputs,particularly in settings where acting on the outputs generated by themachine learning system requires user trust and confidence in theirvalidity.

To address this issue, the machine learning system can generate a set ofmulti-modal data “archetypes,” e.g., that can provide a way ofinterpreting the dimensions of the latent space. More specifically, eachmulti-modal data archetype can be a collection of multi-modal data thatexplains a respective dimension of the latent embedding space, i.e., byproviding a representation of the dimension of the latent space in thespace of multi-modal data. By providing a way of interpreting thedimensions of the latent space, the multi-modal data archetypes canfacilitate more efficient use of computational resources. For example,as will be described in more detail below, a user can evaluate themulti-modal data archetypes to determine that one or more dimensions ofthe latent space represent multi-modal data that is substantiallyirrelevant, e.g., to a medical condition of interest. In response, themachine learning system can remove the specified dimensions of thelatent space, thus reducing the dimensionality of the latent space, andas a result, reducing consumption of computational resources (e.g.,memory and computing power) during clustering of the multi-modal dataembeddings in the latent space.

In some cases, to generate multi-modal data archetypes, the machinelearning system can process multi-modal data characterizing each patientin a population of patients using an encoder neural network to generatea set of embeddings in a latent space. The machine learning system canprocess the set of embeddings to generate a set of region parametersdefining a region of the latent space that encloses the set ofembeddings, e.g., the convex hull of the set of embeddings. The machinelearning system can then generate multi-modal data archetypes based onthe region of the latent space, e.g., by generating a respectivemulti-modal data archetype corresponding to each vertex of the region.In particular, for each vertex of the region, the machine learningsystem can identify a respective embedding having minimum distance tothe vertex from among the set of embeddings, and identify themulti-modal data represented by the minimum distance embedding as beinga multi-modal data archetype.

The machine learning system can thus leverage the geometry of thedistribution of the set of embeddings in the latent space to identify aset of multi-modal data archetypes that represent patterns andcorrelations in multi-modal data across the population of patients. Thenumber of multi-modal data archetypes can be significantly less than thenumber of patients in the population of patients (e.g., by one or moreorders of magnitude), and the multi-modal data archetypes thus providean efficient and compact representation of multi-modal datacharacterizing the population of patients. Each multi-modal dataarchetype can define actual multi-modal data characterizing a real-worldpatient, as opposed to, e.g., multi-modal data synthesized by themachine learning system, and is thus more reliable as a result of beingdirectly anchored to real-world multi-modal patient data.

The machine learning system jointly trains the encoder neural network(that processes multi-modal data to generate embeddings) along with adecoder neural network (that processes embeddings to generatemulti-modal data). The machine learning system then uses the trainedneural networks to identify patient categories. The machine learningsystem trains the encoder and decoder neural networks to optimize anobjective function that increases the clinical relevance, e.g., to aparticular medical condition, of the patient categories identified usingthe encoder and decoder neural networks.

For example, the objective function can include a reconstruction lossfunction. To train the encoder and decoder neural networks using thereconstruction loss function, the machine learning system processesmulti-modal data using the encoder neural network to generate anembedding, and then processes the embedding using the decoder neuralnetwork to generate a reconstruction (i.e., an estimate) of the originalmulti-modal data.

The reconstruction loss function penalizes errors in the reconstructedmulti-modal data, in particular, by penalizing a respective error in thereconstruction of each feature dimension of the multi-modal data basedon the relevance of the feature dimension to a medical condition. Morespecifically, errors in the reconstruction of feature dimensions thatare more relevant to the medical condition incur a higher penalty, underthe reconstruction loss function, than errors in the reconstruction offeature dimensions that are less relevant to the medical condition. Thereconstruction loss function thereby encourages embeddings topreferentially preserve information content from multi-modal data thatis most relevant to the medical condition, and thus increases therelevance of the patient categories (i.e., which are determined byclustering the embeddings) to the medical condition.

As another example, the objective function can include an archetype lossfunction that is defined with reference to one or more “target”multi-modal data archetypes which can be specified, e.g., by a user ofmachine learning system. Each target multi-modal data archetype isassociated with a corresponding dimension of the latent space andrepresents a target (i.e., desired) output to be generated by thedecoder neural network by processing an embedding representing thedimension of the latent space. For each of one or more dimensions of thelatent space that are associated with a respective target multi-modaldata archetype, the archetype loss function encourages the decoderneural network to map an embedding representing the dimension of thelatent space onto the target multi-modal data archetype.

Generally, a user can select target multi-modal data archetypes usingany appropriate criteria, and selecting target multi-modal dataarchetypes can enable a user to control how the encoder neural networkrepresents multi-modal data in the latent space. This provides asignificant advantage over training paradigms that treat the latentspace as a “black box” outside the control of the user.

Moreover, users can select target multi-modal data archetypes thatrepresent clinically meaningful patterns in multi-modal datacharacterizing patients. In particular, users can select targetmulti-modal data archetypes that are relevant to a medical condition,e.g., that include multi-modal features that typically co-occur inpatients having the medical condition. Thus the archetype loss functioncan encourage embeddings generated by the encoder neural network torepresent information relevant to the medical condition, and therebyincrease the relevance of the patient categories (i.e., which aredetermined by clustering the embeddings) to the medical condition.

As another example, the objective function can include a clustering lossfunction based on a clustering, in the latent space, of embeddingsgenerated by the encoder neural network. The clustering loss functioncan encourage embeddings generated by the encoder neural network toseparate into clusters in the latent space, and can reduce anydependence of the clusters on “confounding” features. Confoundingfeatures can refer to features that are designated (e.g., by a user) asbeing substantially irrelevant, e.g., to a medical condition or to atreatment for a medical condition.

The reconstruction loss function, the archetype loss function, and theclustering loss function can enable reduced consumption of computationalresources, e.g., memory and computing power, during training of theencoder and decoder neural networks. For example, the reconstruction,archetype, and clustering loss functions can enable the machine learningsystem to achieve an acceptable performance in identifying patientcategories using encoder and decoder neural networks that have beentrained over fewer training iterations, using less training data, orboth.

The machine learning system can condition multi-modal datacharacterizing a “target” subject based on conditioning data, derivedfrom a population of “reference” subjects, to generate conditionedmulti-modal data, e.g., that the machine learning system cansubsequently process to classify the target subject into a patientcategory. (“Conditioning” multi-modal data based on conditioning datacan refer to updating the multi-modal data by combining the conditioningdata with the multi-modal data). Conditioning the multi-modal data hasthe effect of enriching the information content of the multi-modal datacharacterizing the target subject based on auxiliary data derived from apopulation of reference subjects. For example, the reference subjectsmay be subjects who have received a medical treatment, in particular, adrug, and the machine learning system can condition the multi-modal datacharacterizing the target subject on average penetration of the druginto respective brain regions across the population of referencepatients. As another example, the machine learning system can conditionthe multi-modal data characterizing the target subject on conditioningdata defining statistical correlations, measured across the populationof reference subjects, e.g., between gene expression levels or proteinexpression levels in the reference subjects.

Conditioning multi-modal data on conditioning data derived from apopulation of reference subjects can enable the encoder neural networkto generate richer multi-modal data embeddings that facilitate moreeffective (e.g., more clinically relevant) patient classification.Moreover, conditioning multi-modal data can enable reduced consumptionof computational resources during training of the encoder and decoderneural networks, e.g., by causing the machine learning system to achievean acceptable performance, e.g., in identifying patient categories andclassifying patients, over fewer training iterations, using lesstraining data, or both.

As part of classifying a patient as being included in a patientcategory, the machine learning system can determine an uncertainty ofthe patient classification. Uncertainty in patient classification canarise from, e.g., errors and noise in the multi-modal datacharacterizing the patient, as well as ambiguity inherent in mappingcomplex, high-dimensional multi-modal data to a discrete set of patientcategories. The machine learning system can incorporate the uncertaintyof the patient classification into an automated process for generatingclinical recommendations (e.g., for patient treatment), e.g., byrefraining from generating a clinical recommendation in cases where thepatient classification is uncertain. Acting on clinical recommendations,e.g., to administer treatments to patients, requires user trust andconfidence in the validity of the clinical recommendations. Measuringuncertainty as part of an automated process for generating clinicalrecommendations can increase the clinical applicability of the machinelearning system.

The machine learning system can generate a “trust set” for a patientthat specifies a proper subset of the full set of patient categories,where the patient is predicted to be included in a patient categorywithin the trust set with at least a threshold probability (e.g., 95%).In contrast to a point estimate for a patient classification, i.e., thatdefines a single “best guess” for the patient category of the patient,the trust set can explain the uncertainty in the patient classification.The trust set can thus increase the interpretability of patientclassifications generated by the machine learning system, e.g., byexplaining the uncertainty in patient classifications, which can furtherenhance the clinical applicability of the machine learning system.

The machine learning system can generate a response score for eachpatient category in the set of patient categories, where the responsescore for a patient category characterizes a predicted response ofpatients included in the patient category to receiving a drug. Togenerate the response scores, the machine learning system can generate adrug signature that characterizes a respective impact of the drug on thevalue of each of multiple features characterizing an entity thatreceives the drug, e.g., genomic features, proteomic features, etc. Themachine learning system can generate an embedding of the drug signatureusing an encoder neural network which has been trained to processmulti-modal data characterizing patients. The machine learning systemcan then generate the response scores for the patient categories, e.g.,by comparing the drug signature embedding to clusters of multi-modaldata embedding representing the patient categories.

The machine learning system can thus generate response scores byleveraging an encoder neural network which has been trained to processone type of data - in particular, multi-modal data characterizingpatients -- to process a different type of data - in particular, drugsignatures -without requiring the use of additional training data ortraining iterations. Moreover, the machine learning system can generatethe response scores in an unsupervised fashion, in particular, withoutrequiring “labeled” training data, e.g., that associates patients orpatient categories with real world patient response data. The machinelearning system thereby enables more efficient use of computationalresources, e.g., as compared to a system that generates response scoresby training a specialized machine learning model from scratch on labeledtraining data.

The details of one or more embodiments of the subject matter of thisspecification are set forth in the accompanying drawings and thedescription below. Other features, aspects, and advantages of thesubject matter will become apparent from the description, the drawings,and the claims.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows an example machine learning system.

FIG. 2 shows an example architecture of an encoder neural network and adecoder neural network.

FIGS. 3A - 3B show example archetype generation systems.

FIGS. 4A-C illustrate archetypes and example processes for generatingarchetypes.

FIG. 5 shows an example patient clustering system.

FIG. 6 shows an example cluster analysis system.

FIG. 7A shows an example patient classification system.

FIG. 7B shows an example set prediction system.

FIG. 8 illustrates an example workflow for using a patientclassification system to generate predictions characterizing a patientand to make clinical decisions related to the patient.

FIG. 9 shows an example training system.

FIG. 10 illustrates an example of a reconstruction loss.

FIG. 11 illustrates an example of an archetype loss.

FIG. 12 shows an example cluster hardening system.

FIG. 13 provides a visual illustration of applying a clusteringoperation to a set of embeddings.

FIG. 14 provides a visual illustration of the clustering lossencouraging embeddings to separate into clusters in the latent space.

FIG. 15 provides a visual illustration of the clustering lossencouraging confounding features corresponding to embeddings withdifferent cluster labels to become more entangled in the confoundingfeature space.

FIG. 16 shows an example conditioning system.

FIG. 17 illustrates an example of operations performed by a conditioningsystem.

FIG. 18 shows an example response estimation system.

FIG. 19 illustrates an example of computing a drug signature based ongene expression in a cell.

FIG. 20 illustrates examples of response scores for patient categories.

FIG. 21 is a flow diagram of an example process for classifying apatient as being included in a patient category.

FIG. 22 is a flow diagram of an example process for generating amulti-modal data archetype and a corresponding archetype representationfor each dimension of a latent space.

FIG. 23 is a flow diagram of an example process for jointly training anencoder neural network and a decoder neural network.

FIG. 24 is a flow diagram of an example process for determining aclustering loss during joint training of an encoder neural network and adecoder neural network.

FIG. 25 is a flow diagram of an example process for conditioningmulti-modal data characterizing a target subject based on conditioningdata derived from a population of reference subjects.

FIG. 26 is a flow diagram of an example process for generating aclinical recommendation for medical treatment of a patient.

FIG. 27 is a flow diagram of an example process for generating arespective response score for each patient category in a set of patientcategories.

FIG. 28 shows an example of 12 multi-modal data archetypes relative to aset of multi-modal features.

FIGS. 29A-B show an example of clustering the patients in the populationof patients.

FIGS. 30A-B and FIGS. 31A-B show examples of the distribution offeatures for patients within clusters.

FIGS. 32A-C show charts that illustrate the extent to which patientclusters identified by the machine learning system differentiate alongclinical feature dimensions.

FIGS. 33A-C show charts that illustrate the extent to which patientclusters identified by the machine learning system differentiate alonggene expression feature dimensions.

Like reference numbers and designations in the various drawings indicatelike elements.

DETAILED DESCRIPTION

FIG. 1 shows an example machine learning system 100. The machinelearning system 100 is an example of a system implemented as computerprograms on one or more computers in one or more locations in which thesystems, components, and techniques described below are implemented.

The machine learning system 100 processes multi-modal datacharacterizing patients.

Generally, multi-modal data characterizing a patient includes arespective feature representation for each modality in a set of multiplemodalities. A feature representation for a modality refers to acollection of features that collectively represent data from themodality. For convenience, a set of (scalar) features representingmulti-modal data can be understood as being indexed by a set ofdimensions, referred to as “feature” dimensions.

A few examples of possible modalities, and feature representations forthese modalities, are described in more detail next.

In some implementations, multi-modal data characterizing a patientincludes data derived from functional magnetic resonance imaging (fMRI)of the brain of the patient. fMRI data can be derived from a sequence offMRI images, where each fMRI image corresponds to a respective timepoint in a sequence of time points and characterizes blood flow in thebrain at the time point. More specifically, each fMRI image can berepresented as array of voxels, where each voxel is associated with anintensity value that represents blood flow at a corresponding locationin the brain.

To generate a feature representation of fMRI data of the brain of thepatient, the machine learning system 100 can process the fMRI images togenerate a respective blood flow curve for each brain region in a set ofbrain regions that collectively define a parcellation (i.e., partition)of the brain. The blood flow curve for a brain region can define, foreach time point in the sequence of time points, the average blood flowin the brain region at the time point. The machine learning system 100can compute the average blood flow in a brain region at a time point,e.g., by averaging the intensity values of the voxels in the brainregion in the fMRI image for the time point. The machine learning system100 can process the blood flow curves for the brain regions to generatean N × N “functional connectivity” matrix, where N is the number ofregions in the parcellation, and where entry (i,j) of the functionalconnectivity matrix represents a correlation between the blood flowcurves for brain region i and brain region j.

A few example techniques for deriving a feature representation of thefMRI data from the functional connectivity matrix are described in moredetail next.

In one example, a feature representation of the fMRI data includes thefunctional connectivity matrix.

In another example, the machine learning system 100 can generate afeature representation of the fMRI data by projecting the functionalconnectivity matrix onto a vector, where each component of the vector isa combination (e.g., sum or average) of a respective row or column ofthe functional connectivity matrix.

In another example, to generate a feature representation of the fMRIdata, the machine learning system 100 can process the functionalconnectivity matrix to generate an adjacency matrix that represents agraph. The machine learning system 100 can generate the adjacencymatrix, e.g., by setting each value in the functional connectivitymatrix exceeds a predefined threshold to 1, and setting each other valuein the functional connectivity matrix to 0. The adjacency matrixrepresents a graph that includes: (i) a set of nodes, where each nodecorresponds to a respective brain region, and (ii) a set of edges, whereeach edge connects a respective pair of nodes in the graph. Theadjacency matrix defines which nodes in the graph are connected byedges. In particular, an edge connects node i to node j if the value ofentry (i,j) in the adjacency matrix of the graph is 1.

After generating the adjacency matrix representing the graph, themachine learning system 100 can generate a set of graph statisticscharacterizing the topology of the graph represented by the adjacencymatrix, and the set of graph statistics can define the featurerepresentation of the fMRI data. The machine learning system 100 cangenerate any appropriate graph statistics characterizing the topology ofthe graph represented by the adjacency matrix, e.g., an average measureof centrality (e.g., degree centrality, or PageRank centrality) of thenodes in the graph, an average size of connected components of the graph(where the size of a connected component of the graph can refer to,e.g., the number of nodes in the connected component of the graph), adiameter of the graph, etc.

In another example, to generate the feature representation of the fMRIdata, the machine learning system 100 can instantiate a graph thatincludes: (i) a set of nodes, where each node corresponds to arespective brain region, and (ii) a set of edges, where each edgeconnects a respective pair of nodes in the graph. The graph can be afully-connected graph, i.e., such that every pair of nodes in the graphis connected by a respective edge in the graph. The machine learningsystem 100 can further instantiate a respective node embedding for eachnode in the graph and a respective edge embedding for each edge in thegraph. The node embedding for a node can be an embedding (e.g., aone-hot embedding) that identifies the brain region represented by thenode. The edge embedding for an edge connecting a pair of nodesrepresenting brain regions indexed by i and j can be an embeddingrepresenting the value of entry (i,j) in the functional connectivitymatrix. Thus the machine learning system 100 can instantiate the edgeembeddings for the edges in the graph using the functional connectivitymatrix.

After instantiating the graph, the machine learning system 100 canprocess data defining the graph (including the node embeddings and theedge embeddings associated with the graph) using a graph neural networkto generate a latent representation of the graph that defines thefeature representation of the fMRI data. More specifically, at each ofone or more time steps, the graph neural network can update therespective node embedding for each node in the graph by processing thecurrent node embeddings and the current edge embeddings in accordancewith values of a set of graph neural network parameters. The machinelearning system 100 can then combine (e.g., sum or average) the updatednode embeddings associated with the nodes in the graph as of the finaltime step to generate the feature representation of the fMRI data. Thegraph neural network can have any appropriate graph neural networkarchitecture that enables it to perform its described function. Examplesof graph neural network architectures are described with reference to:J. Zhou et al., “Graph neural networks: a review of methods andapplications,” AI Open, Volume 1, 2020, pages 57-81.

Optionally, in addition to generating a “full” functional connectivitymatrix representing functional connectivity between each pair of regionsin the set of regions defining the parcellation of the brain, themachine learning system 100 can generate one or more “reduced”functional connectivity matrices. Each reduced functional connectivitymatrix represents functional connectivity between each pair of regionsin a respective proper subset of the set of regions in the parcellationof the brain. That is, each reduced functional connectivity matrix canbe represented by an n × n matrix, where n is the number of regions inthe corresponding proper subset of the set of regions in theparcellation of the brain, and entry (i,j) of the reduced functionalconnectivity matrix represents a correlation between the blood flowcurves for brain region i and brain region J.

In some cases, the machine learning system 100 generates one or morereduced functional connectivity matrices that each represent functionalconnectivity between a respective set of brain regions that are involvedin performing a respective biological function in the brain. Examples ofbiological functions include, e.g., visual data processing, auditorydata processing, natural language processing, motor control, etc.

In some cases, the machine learning system 100 generates one or morereduced functional connectivity matrices that each represent functionalconnectivity between a respective set of brain regions that areanatomically connected in the brain, e.g., that are physically adjacentto one another in the brain.

The machine learning system 100 can generate a respective featurerepresentation of each reduced functional connectivity matrix using anyappropriate technique, including any of the techniques described abovefor generating a feature representation of a full functionalconnectivity matrix.

In some implementations, multi-modal data characterizing a patient caninclude clinical scale data obtained from a clinical interview with thepatient. Clinical scale data for a patient includes a respective scorefor the patient in each of multiple categories, where each category isassociated with a predefined set of possible scores (e.g., integervalues between 1 and 10). Examples of possible categories include, e.g.:apparent sadness, reported sadness, inner tension, reduced sleep,reduced appetite, irritability, aggressiveness, etc. Examples ofclinical scales include, e.g.: Positive and Negative Syndrome Scale(PANSS), Brief Assessment of Cognition in Schizophrenia (BACS), YoungMania Rating Scale (YMRS), and Montgomery-Asberg Depression Rating Scale(MADRS). The machine learning system 100 can generate a featurerepresentation of clinical scale data, e.g., e.g., as a sequence ofembeddings (e.g., one-hot embeddings), where each embedding representsthe score for the patient in a respective category.

In some implementations, multi-modal data characterizing a patientincludes electroencephalography (EEG) data. Generally, EEG data includesa respective voltage waveform measured by each of one or more electrodesthat are placed at respective locations in proximity to the brain of thepatient. The voltage waveform measured by an electrode includes arespective voltage measurement at the location of the electrode at eachtime point in a sequence of time points.

The machine learning system 100 can generate a feature representation ofEEG data in a variety of possible ways. For example, the machinelearning system 100 can generate a feature representation of the EEGdata by stacking each of the voltage waveforms into a waveform array,e.g., such that each row or column of the waveform array represents arespective voltage waveform. As another example, the machine learningsystem 100 can transform each voltage waveform into a different domain,e.g., by applying a Fourier transform to each voltage waveform totransform the voltage waveform into a frequency domain, and then stackthe transformed voltage waveforms into a transformed waveform array.

In some implementations, multi-modal data characterizing a patientincludes genomic data. The machine learning system 100 can representgenomic data in any of a variety of possible formats. A few examplestechniques for representing genomic data are described in more detailnext.

In one example, the machine learning system 100 can represent genomicdata as a sequence of nucleotides from the genome of the patient, whereeach nucleotide includes a respective nucleobase from a set of possiblenucleobases (in particular: guanine, adenine, cytosine, and thymine).The machine learning system 100 can generate a feature representation ofthe genomic data, e.g., as a sequence of embeddings, where eachembedding corresponding to a respective nucleotide in the sequence ofnucleotides and identifies the nucleobase included in the nucleotide.

In another implementation, the machine learning system 100 can representgenomic data with reference to a predefined set of genes. In particular,the machine learning system 100 can measure a respective degree to whicheach gene in the predefined set of genes is expressed in the genome ofthe patient, and the collection of gene expression values cancollectively define the genomic data.

In another example, the machine learning system 100 can representgenomic data with reference to a predefined set of locations of interestin the genome of the patient. In particular, the machine learning system100 can generate a respective representation (e.g., one-hot embedding)identifying the nucleobase included in the nucleotide at each locationof interest in the genome of the patient. The representations of thenucleobases at the locations of interest in the genome of the patientcan collectively define the genomic data.

In some implementations, multi-modal data characterizing a patientincludes proteomic data, e.g., that characterizes the expression levelsof various proteins in the patient. The proteomic data represent, foreach protein in a predefined set of proteins, a level of expression ofthe protein in the patient.

In some implementations, multi-modal data characterizing a patientincludes epigenetic data, e.g., that characterizes epigeneticmodifications to the genetic material of the patient. Epigeneticmodifications are modifications of DNA that can affect gene expressionwithout necessarily altering the DNA sequence. Examples of epigeneticmodifications include, e.g., DNA methylation and histone modification.The machine learning system 100 can represent epigenetic data for apatient in any of a variety of possible formats, e.g., by defining arespective rate of occurrence, in the genome of a patient, of each ofmultiple epigenetic modifications.

In some implementations, multi-modal data characterizing a patientincludes transcriptomic data, e.g., that characterizes RNA transcriptsthat are produced by the genome of a patient. The machine learningsystem 100 can represent transcriptomic data for a patient in any of avariety of possible formats, e.g., by defining a respective rate ofexpression, in the patient, of each of multiple RNA transcripts.

In some implementations, multi-modal data characterizing a patientincludes demographic data for the patient, e.g., characterizing one ormore of: the age, sex, or race of the patient.

In some implementations, multi-modal data characterizing a patientincludes characteristics of the family history of the patient, e.g.,whether the extended family of the patient includes incidents ofdisease, e.g., amyotrophic lateral sclerosis (ALS), dementia,Alzheimer’s disease, or frontotemporal disorders (FTD).

In some implementations, multi-modal data characterizing a patientincludes data characterizing progression of disease in the patient,e.g., the site of onset of a disease, e.g., the bulbar region of thebody, the axial region of the body, or the limbs.

In some implementations, multi-modal data characterizing a patientinclude data characterizing a severity of the disease in the patient,e.g., on a predefined staging scale, e.g., the El Escorial Criteria, orthe Revised Amyotrophic Lateral Sclerosis Functional Rating Scale(ALSFRSR).

In some implementations, multi-modal data characterizing a patient caninclude one more physiological characteristics of the patient, e.g., thegrip strength of the patient or the respiratory function of the patient(e.g., the forced vital capacity and slow vital capacity of thepatient).

In some implementations, multi-modal data characterizing a patientincludes audio data, e.g., that represents a sequence of words spoken bythe patient. The feature representation of the audio data can include,e.g., an audio waveform that includes a respective audio sample at eachtime point in a sequence of time points, or a spectrogramrepresentation.

In some implementations, multi-modal data characterizing a patientincludes video data that shows, e.g., the face of the patient or theentire body of the patient as the patient performs a task, e.g.,speaking a sequence of words. The video data can be represented, e.g.,as a sequence of video frames, or as a sequence of facial activityvectors. Each facial activity vector can correspond to a respectivevideo frame, and can identify whether the face of the patient in thecorresponding video frame is exhibiting each facial activity in a set ofpossible facial activities, e.g., eyes downcast, eyes turned left, eyesturned right, eyebrows raised, etc.

In some cases, multi-modal data characterizing a patient can includemultiple feature representations for certain modalities in the set ofmodalities (i.e., rather than only a single feature representation foreach modality). For example, the multi-modal data can include multiplefeature representations corresponding to the fMRI modality, includingrespective feature representations of a full functional connectivitymatrix and one or more reduced functional connectivity matrices, asdescribed above.

In many cases, multi-modal data is collected by a device that measuresone or more physical attributes of a patient. Such physical attributesare often indicative of the health of the patient.

In some cases, multi-modal data characterizing a patient can be“longitudinal” multi-modal data. That is, multi-modal datacharacterizing a patient can include respective multi-modal datacaptured at each time point in a sequence of multiple time points.Longitudinal multi-modal data can extend across any appropriate timespan, e.g., with multi-modal data captured each hour in a sequence ofhours, or each day in a sequence of day, or each month in a sequence ofmonths, or each year in a sequence of years, etc.

The machine learning system 100 includes one or more of: an encoderneural network 104, a decoder neural network 108, a training system 900,an archetype generation system 300, a patient clustering system 500, acluster analysis system 600, a patient classification system 700, and aconditioning system 1600 which will each be described in more detailbelow.

The encoder neural network 104 is configured to process inputmulti-modal data 102 characterizing a patient to generate an embedding106 of the input multi-modal data 102 in a multidimensional latentspace, i.e., a space of possible embeddings.

The decoder neural network 108 is configured to process an embedding 106from the latent space to generate output multi-modal data 110.

Generally, the encoder neural network 104 and the decoder neural network108 can have any appropriate neural network architectures that enablethem to perform their described functions. Example architectures of theencoder neural network and the decoder neural network 108 are describedin more detail with reference to FIG. 2 .

The training system 900 can jointly train the encoder neural network 104and the decoder neural network 108 on a set of training data thatincludes multiple training examples. Each training example correspondsto a respective patient and includes multi-modal data characterizing thepatient.

To jointly train the encoder neural network 104 and the decoder neuralnetwork 108 on a training example, the training system 900 processes themulti-modal data from the training example using the encoder neuralnetwork 104, in accordance with values of a set of encoder neuralnetwork parameters, to generate an embedding of the multi-modal data.The training system 900 then processes the embedding of the multi-modaldata using the decoder neural network, in accordance with values of aset of decoder neural network parameters, to generate multi-modal datathat defines a reconstruction (i.e., an estimate) of the multi-modaldata from the training example.

The training system 900 then updates the respective values of theencoder neural network parameters and the decoder neural networkparameters to optimize an objective function that includes areconstruction error term. The reconstruction error term measures anerror between: (i) the multi-modal data from the training example, and(ii) the reconstruction of the multi-modal data from the trainingexample.

The training encourages the encoder neural network 104 to generateembeddings of multi-modal data that preserve the information content ofthe multi-modal data, i.e., such that the multi-modal data can bereconstructed by the decoder neural network by processing theembeddings. Generally, an embedding of multi-modal data has a lowerdimensionality than the multi-modal data itself, and thus an embeddingof multi-modal data provides a compressed representation of themulti-modal data. The embeddings generated by the encoder neural networkenable more efficient use of computational resources during processingof multi-modal data by the machine learning system 100. In particular,the embeddings occupy less space than the original multi-modal data whenstored in a memory, and downstream processing of the embeddings requiresfewer arithmetic operations (e.g., additions and multiplications) thanwould be required to process the original multi-modal data.

An example of a training system 900 for jointly training the encoderneural network 104 and the decoder neural network 108 is described inmore detail with reference to FIG. 9 . (Optionally, in implementationswhere the machine learning system 100 uses a graph neural network togenerate feature representations of fMRI data, as described above, thetraining system 900 can jointly train the graph neural network alongwith the encoder neural network 104 and the decoder neural network 108).

After the encoder neural network 104 and the decoder neural network 108have been jointly trained by the training system 900, the encoder neuralnetwork 104 and the decoder neural network 108 are provided for use byone or more of: the archetype generation system 300, the patientclustering system 500, the cluster analysis system 600, or the patientclassification system 700.

The archetype generation system 300 is configured to generate a set ofmulti-modal data “archetypes.” In some implementations, each multi-modaldata archetype is a collection of multi-modal data of the same form asthe multi-modal data provided as an input to the encoder neural network104 and generated as an output by the decoder neural network 108.However, rather than directly characterizing individual patients,multi-modal data archetypes are exemplars that typify patterns expressedin multi-modal data characterizing a population of patients. Inparticular, each multi-modal data archetype can “explain” a respectivedimension of the latent embedding space by providing a representation ofthe dimension of the latent embedding space in the space of multi-modaldata. Thus the multi-modal data archetypes provide a way of interpretingthe latent embedding space, as will be described in more detail below.Examples of archetype generation systems are described in more detailwith reference to FIGS. 3 .

The patient clustering system 500 is configured to perform a clusteringoperation on a set of embeddings in the latent space representingrespective multi-modal data for each patient in a population of patientsto identify “clusters” (i.e., groups) of embeddings in the latent space.Each of these clusters represents a patient category, and the clustersdefine a partition of the population of patients into the patientcategories. An example of a patient clustering system 500 is describedin more detail below with reference to FIG. 5 .

The cluster analysis system 600 is configured to generate a respective“class distribution” for each patient category identified by the patientclustering system 500. The class distribution for a patient categorydefines, for each class in a set of classes, a fraction (i.e.,proportion) of patients included in the patient category that areassociated with the class. The classes can include, e.g., one classindicating that a patient is classified as having responded to a medicaltreatment, and another class indicating that a patient is classified ashaving not responded to the medical treatment. An example of a clusteranalysis system 600 is described in more detail with reference to FIG. 6.

The patient classification system 700 is configured to processmulti-modal data characterizing a patient to classify the patient asbeing included in a patient category identified by the patientclustering system 500. The classification of a patient into a patientcategory can be used in conjunction with the class distribution for thepatient category, e.g., as a basis for making inferences about thepatient and for making clinical decisions related to medical care forthe patient, as will be described in more detail below. An example of apatient classification system 700 is described in more detail withreference to FIG. 7A.

The conditioning system 1600 is configured to preprocess multi-modaldata provided to the machine learning system 100, e.g., prior to themulti-modal data being processed by the encoder neural network 104. Theconditioning system 1600 can process multi-modal data characterizing a“target” subject by conditioning the multi-modal data characterizing thetarget patient on conditioning data derived from a population of“reference” subjects. (“Conditioning” multi-modal data based onconditioning data can refer to updating the multi-modal data bycombining the conditioning data with the multi-modal data). Morespecifically, the conditioning system 1600 can enrich the multi-modaldata characterizing the target subject based on feature representations,corresponding to a reference modality (e.g., a PET or fMRI modality) ofthe reference subjects. An example of a conditioning system 1600 isdescribed in more detail with reference to FIG. 16 .

The response estimation system 1800 is configured to generate arespective response score for each patient category determined by thepatient clustering system 500, where the response score for a patientcategory characterizes a predicted response of patients included in thepatient category to receiving a drug. An example of a responseestimation system 1800 is described in more detail with reference toFIG. 18 .

FIG. 2 shows an example architecture of an encoder neural network 104and a decoder neural network 108.

The encoder neural network 104 receives input multi-modal data 202 thatincludes multiple modality feature representations 204-A - 204-N. Eachmodality feature representation 204-A -204-N includes a collection offeatures that collectively represent data from a corresponding modality.Examples of modality feature representations are described withreference to FIG. 1 .

The encoder neural network 104 includes multiple encoder subnetworks206-A - 206-N, where each encoder subnetwork corresponds to a respectivemodality and is configured to receive as input a feature representationof the corresponding modality. For example, encoder subnetwork 206-A isconfigured to receive modality feature representation 204-A, and encodersubnetwork 206-N is configured to receive modality featurerepresentation 204-N.

Each encoder subnetwork processes a corresponding modality featurerepresentation to generate a set of parameters that define a probabilitydistribution over the latent space. For example, each encoder subnetworkE_(i) can generate a mean vector µ_(i) and a covariance matrix V_(i) ofa Normal distribution over the latent space.

The encoder neural network 104 combines the probability distributionparameters generated by each encoder subnetwork to generate parametersof a “posterior” probability distribution over the latent space. Forexample, if each encoder subnetwork generates mean and covarianceparameters of a Normal distribution, as described above, then theencoder neural network can generate the a mean vector µ and a covariancematrix V of the posterior probability distribution as:

$\begin{matrix}{\mu = \left( {\sum\limits_{i = 0..n}{\mu_{i}V_{i}^{- 1}}} \right)\left( {\sum\limits_{i}V_{i}^{- 1}} \right)^{- 1}} & \text{­­­(1)}\end{matrix}$

$\begin{matrix}{V = \left( {\sum\limits_{i = 0..n}V_{i}} \right)^{- 1}} & \text{­­­(2)}\end{matrix}$

where µ₀ is a mean vector of a predefined “prior” Normal probabilitydistribution, V₀ is a covariance matrix of the predefined prior Normaldistribution, and for each i ∈ {1, ..., n}, µ_(i) is the mean vectorgenerated by encoder subnetwork i and V_(i) is the covariance matrixgenerated by encoder subnetwork i.

The encoder neural network 104 generates the embedding 208 of the inputmulti-modal data 202 using the posterior probability distribution overthe latent space. For example, the encoder neural network 104 can selectthe embedding of the input multi-modal data as the mean of the posteriorprobability distribution over the latent space. (During training, theembedding 208 can be sampled from the latent space in accordance withthe posterior probability distribution, as will be described in moredetail below).

The decoder neural network 108 includes multiple decoder subnetworks210-A - 210-N. Each decoder subnetwork is configured to process anembedding 208 from the latent space (e.g., an embedding generated by theencoder neural network) to generate a corresponding modality featurerepresentation. For example, decoder subnetwork 210-A is configured togenerate modality feature representation 212-A, and decoder subnetwork210-N is configured to generate modality feature representation 212-N.The collection of modality feature representations generated by thedecoder subnetworks collectively define the output multi-modal data 214.

Generally, each of the encoder subnetworks and each of the decodersubnetworks can have any appropriate neural network architecture whichenables them to perform their described functions. In particular, eachencoder subnetwork and each decoder subnetwork can have any appropriatetypes of neural network layers (e.g., fully-connected layers,convolutional layers, attention layers, etc.) in any appropriate numbers(e.g., 5 layers, 25 layers, or 50 layers) and connected in anyappropriate configuration (e.g., as a linear sequence of layers).

In some cases, the input multi-modal data 202 can be incomplete, i.e.,certain modality feature representations can be missing from the inputmulti-modal data 202. This can occur, e.g., if data from certainmodalities were not collected for a patient, or are otherwiseunavailable for a patient. In this situation, the encoder neural network104 can generate an embedding 208 of the input multi-modal data 202 byprocessing the available modality feature representations using thecorresponding encoder subnetworks, and combining the outputs of theencoder subnetworks in accordance with equations (1) - (2). Encodersubnetworks that are configured to process the missing modality featurerepresentations are not used to generate the embedding 208 of the inputmulti-modal data 202.

The decoder neural network 108 can generate a complete set ofmulti-modal output data 214, i.e., that includes each modality featurerepresentation, by processing any embedding 208 from the latent space,including embeddings 208 generated by the encoder neural network usingincomplete multi-modal input data 202.

FIG. 3A and FIG. 3B show respective example implementations of anarchetype generation system. The archetype generation systems shown inFIG. 3A and FIG. 3B are examples of systems implemented as computerprograms on one or more computers in one or more locations in which thesystems, components, and techniques described below are implemented. Thearchetype generation systems described with reference to FIG. 3A andFIG. 3B can be used in combination, or as alternatives, or otherwiseimplemented and used in any other appropriate fashion for generatingmulti-modal data archetypes and archetype representations, as will bedescribed in more detail below.

FIG. 3A shows an example archetype generation system 300A. The archetypegeneration system 300A is configured to generate a set of multi-modaldata archetypes 304, and for each archetype, a corresponding archetyperepresentation 310. Each multi-modal data archetype 304 corresponds to arespective dimension of the latent space and provides a representationof the dimension of the latent space in the space of multi-modal data.An archetype representation of a multi-modal data archetype provides aninterpretable representation of the multi-modal data archetype, as willbe described in more detail below.

The archetype generation system 300A generates the multi-modal dataarchetypes 304 and the archetype representations 310 using a decoderneural network 108 (as described with reference to FIG. 1 and FIG. 2 )and a representation engine 306. Prior to being used by the archetypegeneration system 300A, the decoder neural network 108 is jointlytrained, along with an encoder neural network 104, e.g., by the trainingsystem 900 described with reference to FIG. 9 . The representationengine 306 will be described in more detail below.

The archetype generation system 300A generates the multi-modal dataarchetypes using a set of “basis” embeddings in the latent space ofmulti-modal data embeddings that provide a basis of the latent space. Aset of embeddings in the latent space is said to provide a basis of thelatent space if each possible embedding in the latent space can beuniquely represented as a linear combination of set of embeddings. Thatis, a set of embeddings is said to provide a basis if, for each possibleembedding, there exists a unique (i.e., exactly one) set of scalarcoefficients such that combining the set of embeddings by a linearcombination using the set of scalar coefficients yields the possibleembedding. Each basis embedding in a set of basis embeddings can beunderstood to represent a respective dimension of the latent space.

The archetype generation system 300A can generate the multi-modal dataarchetypes using any appropriate set of basis embeddings in the latentspace. For example, the set of basis embeddings can be given by the setof “unit” embeddings in the latent space. A unit embedding in the latentspace refers to an embedding in the latent space where one position inthe embedding has value 1, and the other positions in the embedding havevalue 0. As another example, the set of basis embeddings can be given bya set of embeddings obtained by scaling the set of unit embeddings by anon-zero value. As another example, the set of basis embeddings can begiven by a set of embeddings obtained by rotating the set of unitembeddings in the latent space by any non-zero angle along any axis inthe latent space.

The set of basis embeddings 302 can be a predefined set of basisembeddings, or a set of basis embeddings that are provided to thearchetype generation system 300A, e.g., by a user of the archetypegeneration system.

The archetype generation system 300A processes each basis embedding 302using the decoder neural network 108 to generate multi-modal data thatdefines a corresponding multi-modal data archetype 304. The multi-modaldata archetypes are exemplars that typify patterns expressed inmulti-modal data characterizing a population of patients, e.g., thepopulation of patients that provided the multi-modal data used fortraining the decoder neural network 108. In particular, each multi-modaldata archetype provides a representation of a corresponding dimension ofthe latent space in the space of multi-modal data.

Multi-modal data characterizing any patient can be represented as anembedding in the latent space, i.e., by processing the multi-modal datacharacterizing the patient using an encoder neural network. Theembedding of the multi-modal data characterizing the patient can beuniquely expressed as a combination of the basis embeddings in thelatent space. Thus the basis embeddings in the latent space provide aset of “latent archetypes” that can be used to represent an embedding ofmulti-modal data characterizing any patient. The multi-modal dataarchetypes provide a representation of the latent archetypes in thespace of multi-modal data.

An illustration of the concept of archetypes is provided in FIGS. 4 .The shape of a human face is generally a combination of one or moreunderlying shapes, e.g., oblong, oval, round, rectangular, etc. Forexample, the shape of one face might be a combination of oblong and ovalshapes, while the shape of another face might be a combination ofdiamond and inverted triangle shapes.

Each of the underlying shapes can be understood as being a face shapearchetype, i.e., an exemplar that typifies patterns expressed in theshapes of human faces. Similarly, each of the multi-modal dataarchetypes typify patterns expressed in multi-modal data characterizingpatients. It can be appreciated that the face shape archetypesillustrated in FIG. 4A provide a way of interpreting the distribution ofhuman face shapes. The multi-modal data archetypes 304 similarly providea way of interpreting the distribution of multi-modal datacharacterizing patients, and in particular, of interpreting themulti-modal data represented by each dimension of the latent space.

However, in contrast to the face shape archetypes illustrated in FIG.4A, which accommodate immediate interpretation by visual inspection, themulti-modal data archetypes 304 are generally high-dimensionalcollections of modality feature representations that can besignificantly more challenging to interpret. Thus while the multi-modaldata archetypes provide a way of interpreting the dimensions of thelatent space, there is a need for a way of interpreting the multi-modaldata archetypes themselves.

To this end, the archetype generation system 300A uses therepresentation engine 306 to generate a respective archetyperepresentation 310 for each multi-modal data archetype 304, as will bedescribed in more detail next.

Generally, multi-modal data (including multi-modal data archetypes andmulti-modal data characterizing patients) can be understood as beingrepresented by a set of feature representations that each include arespective collection of features. For convenience, the featuresrepresenting multi-modal data can be understood as being organized intoa set of feature dimensions, where each feature dimension is associatedwith a value of a corresponding feature representing the multi-modaldata, as described above.

The archetype representation 310 for a multi-modal data archetype 304includes a respective “intensity score,” represented as a numericalvalue, corresponding to each feature dimension of the multi-modal dataarchetype 304. To determine the intensity score for a feature dimensionof a multi-modal data archetype, the representation engine 306identifies a respective value of the feature dimension in multi-modaldata included in each training example in a set of training examples308. The values of the feature dimension in the multi-modal data of thetraining examples collectively define a distribution of values of thefeature dimension. The representation engine 306 then determines theintensity score for the feature dimension based on: (i) the value of thefeature dimension of the multi-modal data archetype, and (ii) thedistribution defined by values of the feature dimension in themulti-modal data of the training examples.

Generally, the intensity score for a feature dimension of a multi-modaldata archetype can characterize a likelihood of the value of the featuredimension of the multi-modal data archetype under the distributiondefined by values of the feature dimension in the multi-modal data ofthe training examples. For example, the representation engine 306 candetermine the intensity score for a feature dimension of a multi-modaldata archetype as:

$\begin{matrix}{z = \frac{x\mspace{6mu} - \mspace{6mu}\mu}{\sigma}} & \text{­­­(3)}\end{matrix}$

where z is the intensity score for the feature dimension, x is the valueof the feature dimension in the multi-modal data archetype, µ is theaverage of the distribution defined by values of the feature dimensionin the multi-modal data of the training examples, and σ is the standarddeviation of the distribution defined by values of the feature dimensionin the multi-modal data of the training examples. (In this example, ahigher value of the intensity score represents a lower likelihood of thevalue of the feature dimension of the multi-modal data under thedistribution defined by values of the feature dimension in themulti-modal data of the training examples).

The set of training examples 308 can be the same set of trainingexamples that are used to train the decoder neural network jointly withthe encoder neural network e.g., by the training system described withreference to FIG. 9 . Alternatively, some or all of the trainingexamples 308 can be “held out” training examples that includemulti-modal data that was not used during training of the decoder neuralnetwork and the encoder neural network.

The archetype representation 310 for a multi-modal data archetype 304facilitates interpretation of the multi-modal data archetype byindicating which features in the multi-modal data archetype have valuesthat differ most significantly from the “expected” feature values acrossthe multi-modal data of the training examples. Particularly if thenumber of feature dimensions is very large (e.g., in the thousands), thearchetype representation for a multi-modal data archetype can enable auser to rapidly identify the feature dimensions that best explain themulti-modal data archetype, e.g., the feature dimensions having thehighest intensity scores.

The archetype generation system 300A can make the multi-modal dataarchetypes 304 and the archetype representations 310 available to a userof the archetype generation system 300A in any of a variety of possibleways. For example, the archetype generation system 300A can illustratethe archetype representations 310 to a user in a visual format, e.g., asshown in FIG. 4B, where the intensity score for each of multiple featuredimensions of a multi-modal data archetype is represented by a shade ofcolor. In this example, higher intensity scores are represented bydarker shades, and lower intensity scores are represented by lightershades.

In addition to providing a way of interpreting the dimensions of thelatent space, the multi-modal data archetypes can further provide amechanism for interpreting multi-modal data characterizing individualpatients.

For example, to facilitate interpretation of input multi-modal datacharacterizing a patient, the archetype generation system 300A canprocess the input multi-modal data using the encoder neural network togenerate an embedding of the input multi-modal data. The archetypegeneration system 300A can then determine a respective coefficient(i.e., numerical value) for each basis embedding in a set of basisembeddings in the latent space such that linearly combining the basisembeddings in accordance with the coefficients yields the embedding ofthe input multi-modal data. The archetype generation system 300A canthen provide an output that identifies: (i) each multi-modal dataarchetype, and (ii) for each multi-modal data archetype, the value ofthe coefficient of the corresponding basis embedding in the latentspace. The values of the coefficients can enable a user to interpret thecontribution of each multi-modal data archetype to the input multi-modaldata.

The archetype generation system 300A can make the multi-modal dataarchetypes 304 and the archetype representations 310 available to users,e.g., through a user interface, e.g., a graphical user interface (GUI).

FIG. 3B shows an example archetype generation system 300B. The archetypegeneration system 300B can be used as an alternative to, or incombination with, the archetype generation system 300A described abovewith reference to FIG. 3A.

The archetype generation system 300B is configured to generate a set ofmulti-modal data archetypes 322, and for each multi-modal data archetype322, a corresponding archetype representation 324. The multi-modal dataarchetypes 322 are exemplars that typify patterns expressed inmulti-modal data characterizing a population of patients. The number ofmulti-modal data archetypes 322 can be significantly less than thenumber of patients in the population, e.g., by one or more orders ofmagnitude. The multi-modal data archetypes thus provide an efficient wayof representing patterns expressed in multi-modal data characterizingthe population of patients. An illustration of the concept of archetypesis provided in FIGS. 4 , and is described in more detail above. Anarchetype representation of a multi-modal data archetype provides aninterpretable representation of the multi-modal data archetype, asdescribed in more detail above with reference to FIG. 3A.

The archetype generation system 300B generates the set of multi-modaldata archetypes 322 using an encoder neural network 104, a regiongeneration engine 316, and a selection engine 320, which are eachdescribed in more detail next.

The encoder neural network 104 is configured to process multi-modal datacharacterizing a patient to generate an embedding of the multi-modaldata in a latent space. An example architecture of the encoder neuralnetwork is described above with reference to FIG. 2 . Prior to beingused by the archetype generation system 300B, the encoder neural network104 is jointly trained, along with a decoder neural network 108, e.g.,by the training system 900 described with reference to FIG. 9 .

For each patient in a population of patients, the archetype generationsystem 300B processes multi-modal data 312 characterizing the patientusing the encoder neural network 104 to generate a correspondingembedding in the latent space. The embeddings of the multi-modal datacharacterizing the patients in the population of patients collectivelyform a set of embeddings 314. The set of embeddings 314 includes arespective embedding for each patient in the population of patients.

The population of patients can include any appropriate number ofpatients, e.g., 1000 patients, 10,000 patients, or 100,000 patients.Patients can be selected for inclusion in the population of patientsbased on any appropriate selection criteria, e.g., selection criteriabased on patient demographics, e.g., age, gender, etc. In someinstances, the population of patients may be candidates for inclusion ina clinical trial for a therapy, e.g., a drug.

The region generation engine 316 processes the set of embeddings 314 togenerate a set of region parameters 318 that define a region of thelatent space. The region parameters 318 can define a region of thelatent space that encloses the set of embeddings 314, e.g., such thateach embedding in the set of embeddings 314 is included in the regiondefined by the region parameters 318. The region parameters can berepresented as an ordered collection of numerical values, e.g., avector, matrix, or other tensor of numerical values.

The region generation engine 316 can be configured to generate regionparameters 318 that define a region of the latent space in anyappropriate way. A few example techniques by which the region generationengine 316 can generate region parameters 318 defining a region of thelatent space are described next.

In one example, the region generation engine 316 can generate regionparameters 318 that define an (approximation of a) convex hull of theset of embeddings in the latent space. The “convex hull” of a set ofembeddings can refer to a convex set, e.g., the minimal convex set, thatcontains each embedding in the set of embeddings. A set can be referredto as “convex” if, for any elements ν₁, ..., ν_(K) included in the set,and for any non-negative scalar coefficients α₁, ..., α_(K) that sum to1, an element defined by:

$v = {\sum\limits_{k = 1}^{K}{\alpha_{k} \cdot v_{k}}}$

is also included in the set. Intuitively, any two elements in a convexset can be joined by a line that lies entirely within the convex set. Aconvex set that contains a set of embeddings can be referred to as the“minimal” convex set containing the set of embeddings if any convex setcontaining the set of embeddings necessarily includes the minimal convexset. The convex hull of the set of embeddings 314 can be a convexpolytope, i.e., a convex region with flat sides. An example of a convexhull of a set of embeddings in three-dimensional (3D) space isillustrated with reference to FIG. 4C and described in more detailbelow.

The region generation engine 316 can generate region parameters 318 thatdefine the convex hull of the set of embeddings in the latent spaceusing any appropriate numerical technique, e.g., the “quickhull”technique described with reference to: C. Barber et al., “The quickhullalgorithm for convex hulls,” ACM Transactions on Mathematical Software,Volume 22, Issue 4, December 1996, pp. 469-483.

Region parameters 318 defining the convex hull of the set of embeddings314 can be represented in any appropriate way. For instance, the regionparameters 318 can define a set of vertex embeddings, where each vertexembedding represents a position of a respective vertex (e.g., corner) ofthe convex hull of the set of embeddings 314. As another example, theset of region parameters 318 can define a set of planar surfaces, whereeach planar surface represents a respective face of a convex polytopedefining the convex hull of the set of embeddings 314. The regionparameters 318 can define a planar surface, e.g., by defining anembedding orthogonal to the planar surface and a embedding positioned onthe planar surface.

In another example, the region generation engine 316 can generate regionparameters 318 that define an (approximation of a) concave hull of theset of embeddings in the latent space. The region generation engine 316can generate region parameters 318 that define the concave hull of theset of embeddings in the latent space using any appropriate numericaltechnique, e.g., the techniques described with reference to: A. Moreiraet al., “Concave hull: a k-nearest neighbors approach for thecomputation of the region occupied by a set of points,” InternationalConference on Computer Graphics Theory and Applications, 2007, pp.61-68.

In another example, the region generation engine 316 can generate regionparameters 318 that define an (approximation of an) alpha shape for theset of embeddings in the latent space. The region generation engine 316can generate region parameters 318 that define an alpha shape of the setof embeddings in the latent space using any appropriate numericaltechnique, e.g., the techniques described with reference to: H.Edelsbrunner, “Alpha shapes - a survey,” in: van de Weygaert R, VegterG, Ritzerveld J, Icke V, eds. Tessellations in the Sciences: Virtues,Techniques and Applications of Geometric Tilings. Springer, 2011.

The selection engine 320 is configured to generate the set ofmulti-modal data archetypes 322 based on: (i) the set of embeddings 314,and (ii) the region parameters 318. In particular, the selection engine320 can select a proper subset of the set of embeddings 314 as being“archetype” embeddings based on their proximity to vertices of theregion defined by the region parameters 318. For each archetypeembedding, the selection engine 320 can identify the multi-modal datarepresented by the archetype embedding as being a multi-modal dataarchetype 322.

More specifically, to identify the archetype embeddings, the selectionengine 320 can determine a set of vertices of the region of the latentspace defined by the region parameters 318. (Each vertex can berepresented as a point in the latent space, where each point in thelatent space can be represented as an ordered collection of numericalvalues, e.g., a vector or other tensor of numerical values). Forinstance, the region parameters 318 can directly define the vertices ofthe region (as described above), and the selection engine 320 can thusdirectly identify the vertices of the region from the values of theregion parameters 318. As another example, the region parameters 318 candefine a set of planar surfaces representing faces of a convex polytoperegion, and the selection engine 320 can identify points on the surfaceof the convex polytope where multiple faces converge to a point asvertices of the region.

After identifying the vertices of the region of the latent space definedby the region parameters 318, the selection engine 320 can identify arespective multi-modal data archetype corresponding to each of thevertices of the region. In particular, for each vertex, the selectionengine 320 can designate an embedding from the set of embeddings 314that has a minimum distance to the vertex from among the embeddings inthe set of embeddings 314 as being an archetype embedding. The selectionengine 320 can then identify the multi-modal data represented by thearchetype embedding as being a multi-modal data archetype 322. Themulti-modal data represented by an archetype embedding can refer to themulti-modal data processed by the encoder neural network 104 to generatethe archetype embedding.

The selection engine 320 can measure distances between embeddings andvertices in the latent space using any appropriate distance measure,e.g., a Euclidean distance or an L₁ distance.

In some cases, an embedding from the set of embeddings 314 may exactlymatch a vertex of the region, and the selection engine 320 can identifythe embedding matching the vertex as being an archetype embedding. Insome cases, none of the embeddings from the set of the embeddings 314match a given vertex, and to identify the archetype embeddingcorresponding to the vertex, the selection engine 320 can compute arespective distance from the vertex to each embedding in the set ofembeddings 314. The selection engine 320 can then identify an embeddinghaving the minimum distance to the vertex from among the set ofembeddings 314 as being the archetype embedding for the vertex.

The multi-modal data archetypes 322 generated by the archetypegeneration system 300B provide a way of interpreting the distribution ofmulti-modal data characterizing patients in the population of patients,and in particular, of efficiently capturing typical patterns expressedin the distribution of multi-modal data across the population ofpatients. More specifically, in the latent space, each embedding in theset of embeddings 314 can be represented as a combination (e.g., alinear combination) of the vertices of the region enclosing the set ofembeddings 314, and each vertex can be represented (approximately orexactly) by an archetype embedding. Thus, the multi-modal data for eachpatient in the population of patients can be understood as a combinationof multi-modal data archetypes corresponding to the archetypeembeddings.

To facilitate interpretation of input multi-modal data characterizing apatient, the archetype generation system 300B can process the inputmulti-modal data using the encoder neural network 104 to generate anembedding of the input multi-modal data. The archetype generation system300B can then determine a respective coefficient (i.e., numerical value)for each archetype embedding such that linearly combining the archetypeembeddings in accordance with the coefficients yields the embedding ofthe input multi-modal data. The archetype generation system 300B canthen provide an output that identifies: (i) each multi-modal dataarchetype 322, and (ii) for each multi-modal data archetype, the valueof the coefficient of the corresponding archetype embedding in thelatent space. The values of the coefficients can enable a user tointerpret the contribution of each multi-modal data archetype to theinput multi-modal data.

The multi-modal data archetypes 322 are generally high-dimensionalcollections of modality feature representations that can be challengingto visually interpret. To address this issue, the archetype generationsystem 300B uses the representation engine 306 to generate a respectivearchetype representation 324 for each multi-modal data archetype 322.Example techniques for generating archetype representations 324 using arepresentation engine 306 are described in more detail above withreference to FIG. 3A.

FIG. 4C illustrates an example of generating multi-modal data archetypesusing the archetype generation system 300B described above withreference to FIG. 3B. For each patient in a population of patients, thearchetype generation system 300B obtains multi-modal data 312characterizing the patient, and processes the multi-modal data 312characterizing the patient using an encoder neural network to generate acorresponding embedding in a latent space 402. The archetype generationsystem 300B thus generates a set of multi-modal data embeddings in thelatent space, i.e., where the set of embeddings includes a respectivemulti-modal data embedding for each patient in the population ofpatients.

The archetype generation system 300B processes the set of embeddings togenerate region parameters defining a region 404 of the latent spacethat encloses the set of embeddings, e.g., a convex hull of the set ofembeddings. The archetype generation system 300B can identify a set ofvertices of the region 404 (e.g., the vertex 406), and identify arespective “archetype” embedding corresponding to each vertex (e.g., theembedding 408), e.g., as the embedding that has minimum distance to thevertex (from among the embeddings in the set of embeddings). For eacharchetype embedding, the archetype generation system 300B can identifythe multi-modal data represented by the archetype embedding as being amulti-modal data archetype. The archetype generation system 300B cangenerate a respective multi-modal data archetype representation 410corresponding to each multi-modal data archetype, e.g., as aninterpretable representation of the multi-modal data archetype, e.g.,using the techniques described above with reference to FIG. 3A.

FIG. 5 shows an example patient clustering system 500. The patientclustering system 500 is an example of a system implemented as computerprograms on one or more computers in one or more locations in which thesystems, components, and techniques described below are implemented.

The patient clustering system 500 processes a set of training examples502 that each include multi-modal data characterizing a respectivepatient from a population of patients to determine a set of patientcategories 508. The patient clustering system 500 also assigns eachpatient from the population of patients to a respective patient category508.

The patient clustering system 500 determines the patient categories 508and the assignment of patients to respective patient categories 508using an encoder neural network 104 (as described with reference to FIG.1 and FIG. 2 ) and a clustering engine 506. Prior to being used by thepatient clustering system 500, the encoder neural network 104 is jointlytrained, along with a decoder neural network, e.g., by the trainingsystem described with reference to FIG. 9 . The clustering engine 506will be described in more detail below.

The patient clustering system 500 processes the multi-modal dataincluded in each training example using the encoder neural network 104to generate an embedding 504 of the multi-modal data included in thetraining example 502.

The patient clustering system 500 then provides the embeddings 504 ofthe multi-modal data from the training examples 502 to the clusteringengine 506. The clustering engine 506 performs a clustering operation onthe embeddings 504 to generate a partition of the set of embeddings intomultiple groups, referred to as “clusters,” that each include multipleembeddings 504.

Generally, the clustering engine 506 performs a clustering operationthat encourages the embeddings in the same cluster to be more similar(accordingly to some similarity measure in the latent space) thanembeddings in different clusters. The clustering engine 506 can clusterthe embeddings 504 using any appropriate clustering operation, e.g., ak-means clustering operation, an expectation maximization clusteringoperation, a hierarchical agglomerative clustering operation, or aspectral clustering operation. The numbers of clusters generated by theclustering engine 506 can be, e.g., a predefined hyper-parameter that isspecified by a user of the patient clustering system, or determineddynamically by the clustering engine 506 during clustering.

In some implementations, prior to performing the clustering operation onthe embeddings 504, the patient clustering system 500 can apply aprojection operation to each embedding 504 to remove one or morespecified dimensions of the embedding 504. Thus, in theseimplementations, the clustering engine 506 clusters projected embeddings504 having fewer dimensions than the original embeddings 504 generatedby the encoder neural network 104.

The dimensions to be removed from the embeddings 504 can be specified,e.g., by a user of the patient clustering system 500, through a userinterface (e.g., a graphical user interface) made available to the useron a user device.

For example, the patient clustering system 500 can use the archetypegeneration system 300 described with reference to FIGS. 3 to generate,for each dimension of the latent space, a respective multi-modal dataarchetype representing the dimension of the latent space. The patientclustering system 500 can provide the multi-modal data archetypes(and/or interpretable archetype representations of the multi-modal dataarchetypes, as described with reference to FIGS. 3 ) to the user,through the user interface, for use by a user assessing which (if any)dimensions should be removed from the embeddings 504. A user maydetermine that a dimension of the embeddings should be removed, e.g., ifthe multi-modal data archetype for the dimension defines multi-modaldata that the user identifies as being substantially irrelevant to amedical condition of interest. The user can provide an input, throughthe user interface, specifying one or more dimensions to be removed fromthe embeddings 504. In response to response to receiving the input fromthe user, the patient clustering system 500 can remove the specifieddimensions from the embeddings 504.

In some implementations, the machine learning system can perform anautomated process to determine that one or more dimensions can beremoved from the embeddings 504 (and, more generally, the latent space).For instance, for each dimension of the latent space, the machinelearning system process the multi-modal data archetype corresponding tothe dimension of the latent space to determine whether a criterion forremoval is satisfied. In response to determining that the criterion forremoval is satisfied, the machine learning system can remove thedimension from the latent space, and in particular, can remove thedimension from each of the embeddings in the latent space. The machinelearning system can implement any appropriate criterion for removal of adimension of the latent space. For instance, the machine learning systemcan determine that a dimension satisfies a criterion for removal fromthe latent space if a feature dimension of the corresponding multi-modaldata archetype satisfies a threshold. Generally, the machine learningsystem can implement appropriate criteria that result in the removal ofdimensions of the latent space that are predicted to be substantiallyirrelevant to a medical condition. Removing a dimension from the latentspace can refer to applying projection operations to remove thedimension from embeddings generated by the encoder neural network.Removing dimensions from the latent space can result in reducedconsumption of computational resources, e.g., by reducing the memoryrequirements to store embeddings in the latent space, and by reducingcompute requirements for clustering embeddings in the latent space.

Applying a projection operation to remove one or more specifieddimensions of the embeddings 504 can reduce consumption of computationalresources, e.g., memory and computing power, during clustering. Removingdimensions of the embeddings 504 corresponding to multi-modal dataarchetypes that are identified as being substantially irrelevant to amedical condition of interest can also increase the relevance of theclusters to the medical condition. For example, removing dimensionsidentified as being substantially irrelevant to a medical condition canincrease the likelihood that embeddings in the same cluster correspondto patients that share characteristics relevant to the medicalcondition.

In some implementations, to cluster the embeddings 504, the clusteringengine 506 can designate certain embeddings 504 as being “archetype”embeddings, where each archetype embedding represents a respectivecluster. For each embedding 504, the clustering engine 506 can determinea respective distance between the embedding 504 and each archetypeembedding, and then assign the embedding 504 to the cluster representedby the archetype embedding having minimum distance to the embedding 504.The clustering engine 506 can thus partition the set of embeddings 504into a number of clusters equal to the number of archetype embeddings,where each archetype embedding represents a respective cluster, andwhere each embedding is assigned to the cluster represented by thearchetype embedding having minimum distance from the embedding. Theclustering engine 506 can measure distances between embeddings in thelatent space using any appropriate distance measure, e.g., a Euclideandistance measure or an L₁ distance measure. The number of embeddings 504designated as being archetype embeddings can be significantly smallerthan the total number of embeddings 504, e.g., by one or more orders ormagnitude.

The clustering engine 506 can determine that an embedding 504 should bedesignated as an archetype embedding using any appropriate criteria. Anexample process for identifying archetype embeddings is described inmore detail with reference to FIG. 3B. Briefly, the clustering enginecan identify a region of the latent space that encloses the set ofembeddings (e.g., a convex hull of the set of embeddings), identify aset of vertices of the region, and determine a respective archetypeembedding corresponding to each vertex of the region. For each vertex ofthe region, the clustering engine can identify the archetype embeddingcorresponding to the vertex as being an embedding 504 having minimumdistance to the vertex, i.e., from among the set of embeddings 504.

Clustering the embeddings 504 with reference to a set of archetypeembeddings, e.g., that are identified by the example process describedwith reference to FIG. 3B, can enable the clustering engine 506 toperform the clustering more efficiently than would otherwise bepossible. In particular, clustering the embeddings 504 with reference toa set of archetype embeddings can be performed in a singleparallelizable computational step, in contrast to clustering techniquesthat rely on performing a large number of serial clustering iterations.

The patient clustering system 500 identifies each cluster of embeddings504 generated by the clustering engine 506 as representing a respectivepatient category 508. The patient clustering system 500 furtheridentifies each patient in the population of patients as being includedin the patient category represented by the cluster that includes theembedding of the multi-modal data characterizing the patient.

The patient clustering system 500 can provide the patient categories 508for use by a cluster analysis system, as will be described in moredetail with reference to FIG. 6 , and a patient classification system,as will be described in more detail with reference to FIG. 7A.

FIG. 6 shows an example cluster analysis system 600. The clusteranalysis system 600 is an example of a system implemented as computerprograms on one or more computers in one or more locations in which thesystems, components, and techniques described below are implemented.

The cluster analysis system 600 receives, from the patient clusteringsystem 500 described with reference to FIG. 5 , data defining a set ofpatient categories. Each patient category represents a respectivecluster of embeddings in a latent space, where each embedding representsmulti-modal data characterizing a patient included in the patientcategory.

Each patient included in each patient category can be associated with aclass from a set of classes. A few examples of possible classes aredescribed next.

In one example, the set of classes can include one class indicating thata patient is classified as having responded to a medical treatment, andanother class indicating that a patient is classified as having notresponded to the medical treatment. The medical treatment can involve,e.g., administering a drug to a patient. A patient can be said to have“responded” to a medical treatment, e.g., if applying the medicaltreatment to the patient caused at least a predefined threshold level ofimprovement in the medical condition of the patient.

As another example, the set of classes can include one class indicatingthat a patient is classified as having experienced significant sideeffects after receiving a medical treatment, and another classindicating that a patient is classified as having not experiencedsignificant side effects after receiving the medical treatment.

As another example, the set of classes can include one class indicatingthat a patient has been diagnosed with a medical condition, and anotherclass indicating that a patient has not been diagnosed with the medicalcondition. The medical condition can be, e.g., a psychiatric condition,e.g., depression or schizophrenia.

FIG. 6 provides an illustration of patient categories 606-A, 606-B, and606-C that each represent a respective cluster of embeddings in thelatent space. In the illustration, each embedding in a first class(“class #1” 602) is represented by an O token, and each embedding in asecond class (“class #2” 604) is represented by an X token.

The cluster analysis system 600 generates a respective classdistribution corresponding to each patient category. The classdistribution for a patient category defines, for each class, arespective fraction of the patients included in the patient categorythat are associated with the class.

In the example illustrated in FIG. 6 , the cluster analysis system 600generates class distribution 608-A for patient category 606-A, classdistribution 608-B for patient category 606-B, and class distribution608-C for patient category 606-C. It can be appreciated that patientsincluded in patient category 606-A are predominately associated withclass #2, patients included in patient category 606-B are predominatelyassociated with class #1, and patients included in patient category606-C are evenly spread between class #1 and class #2.

The class distributions generated by the cluster analysis system 600 canbe used in conjunction with a patient classification system as a basisfor making inferences about patients and for making clinical decisionsrelated to patient care, as will be described in more detail withreference to FIG. 8 .

FIG. 7A shows an example patient classification system 700. The patientclassification system 700 is an example of a system implemented ascomputer programs on one or more computers in one or more locations inwhich the systems, components, and techniques described below areimplemented.

The patient classification system 700 processes input multi-modal data702 characterizing a patient to generate a patient classification 714that classifies the patient as being included in a patient category froma set of patient categories 708.

The set of patient categories 708 can be determined by the patientclustering system 500 described with reference to FIG. 5 . Each patientcategory 708 represents a cluster of embeddings in the latent space,where each embedding represents multi-modal data characterizing arespective patient from a population of patients, referred to as“training” patients for convenience, as described above.

The patient classification system 700 generates the patientclassification 714 using an encoder neural network 104 (e.g., asdescribed with reference to FIG. 1 and FIG. 2 ), a scoring engine 706,and a classification engine 712. Prior to being used by the patientclassification system 700, the encoder neural network 104 is jointlytrained, along with a decoder neural network, e.g., by the trainingsystem described with reference to FIG. 9 . The scoring engine 706 andthe classification engine 712 will be described in more detail below.

The encoder neural network processes the input multi-modal data 702 togenerate an embedding 704 of the input multi-modal data 702 in thelatent space.

The scoring engine 706 determines, for each patient category 708, arespective classification score 710 for the patient category 708 basedon the embedding 704 of the input multi-modal data 702. The scoringengine 706 can determine the classification scores 710 in any of avariety of possible ways. A few example techniques for determining theclassification scores 710 are described in more detail next.

In some implementations, to determine the classification score 710 for apatient category 708, the scoring engine 706 determines a “centroid”embedding for the patient category, e.g., by averaging (or otherwisecombining) the embeddings included in the cluster of embeddingsrepresented by the patient category. The scoring engine 706 thendetermines the classification score 710 for the patient category 708 bycomputing a similarity measure between: (i) the embedding 704 of theinput multi-modal data 702, and (ii) the centroid embedding for thepatient category 708. The similarity measure can be, e.g., a L₂similarity measure, a cosine similarity measure, or any otherappropriate similarity measure.

In some implementations, to determine the classification score 710 for apatient category 708, the scoring engine 706 fits the parameters of aprobability distribution to the embeddings included in the cluster ofembeddings represented by the patient category. For example, theprobability distribution can be a Normal distribution, the parameters ofthe probability distribution can be the mean and covariance parametersof the Normal distribution, and the scoring engine 706 can fit the meanand covariance parameters of the Normal distribution using anyappropriate fitting technique, for example, a maximum likelihoodestimation (MLE) technique. The scoring engine 706 then determines theclassification score 710 for the patient category by computing thelikelihood of the embedding 704 of the input multi-modal data 702 underthe probability distribution.

In some implementations, the scoring engine 706 generates theclassification scores 710 using a classification machine learning modelthat is configured to receive an embedding from the latent space, and toprocess the embedding to generate a set of classification score 710 foreach patient category 708. The classification score 710 for a patientcategory can represent a likelihood that the patient is included in thepatient category.

The classification machine learning model can be appropriate machinelearning model, e.g., a neural network model, a random forest model, ora support vector machine (SVM) model. For example, the classificationmachine learning model can be a neural network model that includes anyappropriate types of neural network layers (e.g., fully-connectedlayers, convolutional layers, or attention layers) in any appropriatenumbers (e.g., 1 layer, 5 layers, or 10 layers) and connected in anyappropriate configuration (e.g., as a linear sequence of layers).

The patient classification system 700 can train the classificationmachine learning model on a set of training data that includes multipletraining examples. Each training example corresponds to a respectivetraining patient and specifies: (i) the embedding, generated by theencoder neural network 104, of the multi-modal data characterizing thetraining patient, and (ii) a label identifying the patient category 708of the training patient. As described above, the patient category of atraining patient identifies the cluster of embeddings (e.g., asgenerated by the patient clustering system described with reference toFIG. 5 ) that includes the embedding of the multi-modal datacharacterizing the training patient.

The patient classification system 700 can train the classificationmachine learning model on the training data using any appropriatemachine learning training technique. For example, if the classificationmachine learning model is a neural network model, then the patientclassification system 700 can train the neural network model on thetraining data using a stochastic gradient descent training technique tooptimize a cross-entropy objective function (or any other appropriateobjective function). Generally, the patient classification system 700trains the classification machine learning model to, for each trainingexample, increase the classification score generated by theclassification machine learning model (i.e., as result of processing theembedding of the training patient) for the patient category thatincludes the training patient.

The classification engine 712 classifies the patient as being includedin a corresponding patient category 708 based on the classificationscores 710. For example, the scoring engine 706 can classify the patientas being included in the patient category associated with the highestclassification score 710.

Optionally, in combination with or as an alternative to generating thepatient classification 714, the patent classification system 700 canprovide the classification scores 710 to a set prediction system 718.The set prediction system 718 can process the classification scores 710to generate a trust set 720 for the patient. The trust set 720 specifiesone or more patient categories that collectively form a proper subset ofthe full set of patient categories, such that the patient is predictedto be include in a patient category within the trust set 720 with atleast a threshold probability 716. The threshold probability 716 can beany appropriate probability, e.g., 50%, 75%, 90%, 95%, or 99%.

In contrast to the point estimate defined by the patient classification714, i.e., which defines a single “best guess” for the patient categoryof the patient, the trust set 720 can include multiple patientcategories. The trust set 720 can thus account for uncertainty in theclassification of the patient into a patient category. For example, if apatient classification is more uncertain, the trust set 720 can reflectthe uncertainty by including a larger number of patient categories.Uncertainty in patient classification can arise from, e.g., errors andnoise in the input multi-modal data, as well as ambiguity inherent inmapping complex, high-dimensional multi-modal data characterizing apatient to a discrete set of patient categories.

The trust set 720 encodes information that is complementary to thepatient classification 714, and both the trust set 720 and the patientclassification 714 can be used to generate clinical recommendations, aswill be described in more detail below with reference to FIG. 8 .

FIG. 7B shows an example set prediction system 718. The set predictionsystem 718 is an example of a system implemented as computer programs onone or more computers in one or more locations in which the systems,components, and techniques described below are implemented.

The set prediction system 718 is configured to process a set ofclassification scores 710 for a patient to generate a trust set 720 forthe patient. The classification scores 710 for the patient include arespective score for each patient category in a set of patientcategories, and can be generated, e.g., by the patient classificationsystem 700 described with reference to FIG. 7A, or by any otherappropriate system. The trust set 720 specifies a proper subset of theset of patient categories such that the patient is predicted to beinclude in a patient category within the trust set 720 with at least athreshold probability 716.

The set prediction system 718 generates the trust set 720 using a set ofcalibration examples 722, a calibration engine 724, a quantile engine728, and a set prediction engine 732, which are each described in moredetail next.

The set of calibration examples 722 is a subset of the set of trainingexamples that are processed by the patient clustering system 500,described with reference to FIG. 5 , to generate the patient categories.More specifically, the patient clustering system 500 receives a set oftraining examples that each correspond to a respective patient andinclude multi-modal data characterizing the patient. The patientclustering system 500 processes the multi-modal data from each trainingexample to generate a corresponding embedding in a latent space, andthen clusters the embeddings in the latent space to identify a set ofclusters of embeddings. Each cluster defines a respective patientcategory, and each patient is defined as being included in the patientcategory represented by the cluster that includes the embedding of themulti-modal data characterizing the patient. Each calibration example722 thus includes a set of multi-modal data characterizing a patient andis associated with a “target” patient category, i.e., determined by thepatient clustering system 500.

In implementations where the patient classification system 700 generatesclassification scores 710 using a classification machine learning model(as described with reference to FIG. 7A), the calibration examples 722are held-out from the training of the classification machine learningmodel. That is, the calibration examples 722 are not used to train theclassification machine learning model.

The calibration engine 724 generates a respective calibration score 726for each calibration example 722. To generate a calibration score 726for a calibration example 722, the calibration engine 724 can processthe multi-modal data included in the calibration example 722 to generateclassification scores for the calibration example 722, e.g., using thepatient classification system 700 described with reference to FIG. 7A.The calibration engine 724 can then generate the calibration score 726for the calibration example 722 based on the classification scores forthe calibration example using a scoring function that measures an errorbetween: (i) a set of classification scores, and (ii) a patientcategory. In particular, the calibration engine 724 can generate thecalibration score 726 for a training example by using the scoringfunction to measure an error between: (i) the classification scores forthe calibration example, and (ii) the target patient category of thecalibration example. A few examples of possible scoring functions aredescribed next.

In one example, the scoring function s(·,·) may be given by:

$\begin{matrix}{s\left( {f(X),y} \right) = 1 - \left( {f(X)} \right|_{y}} & \text{­­­(4)}\end{matrix}$

where ƒ(X) are classification scores, y is a patient category, and ƒ(X)|_(y) is the classification score for patient category y.

In another example, the scoring function s(·,·) may be given by:

$\begin{matrix}{s\left( {f(X),y} \right) = {\sum\limits_{j = 1}^{k}\left( {f(X)} \right|_{\pi_{j}}},\text{where}y = \mspace{6mu}\pi_{k}} & \text{­­­(5)}\end{matrix}$

where ƒ(X) are classification scores, y is a patient category, and

(π_(j))_(j = 1)^(K)

is a permutation of {1, ..., K} that sorts the K classification scoresfrom highest to lowest.

The quantile engine 728 processes the set of calibration scores 726 andthe threshold probability 716 to generate a quantile value 730 as aquantile of the set of calibration scores. For example, the quantileengine 728 can generate the quantile value 730 as the α-th quantile ofthe set of calibration scores, where α is given by:

$\begin{matrix}{a = \frac{\left\lbrack {\left( {n + 1} \right) \cdot p} \right\rbrack}{n}} & \text{­­­(6)}\end{matrix}$

where n is the number of calibration examples 722, [·] denotes a ceilingfunction, and p is the threshold probability 716.

The set prediction engine 732 processes: (i) the classification scores710 for the patient, and (ii) the quantile value 730, to generate thetrust set 720 for the patient. To determine whether a patient categoryis included in the trust set 720, the set prediction engine 732 can usethe scoring function (as described above with reference to thecalibration engine 724) to generate a “test” score for each patientcategory. More specifically, the set prediction engine 732 can generatea test score for a patient category by processing: (i) theclassification scores 710 for the patient, and (ii) data identifying thepatient category, using the scoring function. The set prediction engine732 can then determine that each patient category is included in thetrust set 720 if the test score for the patient category does not exceedthe quantile value 730.

For example, if the scoring function is provided by equation (4), thenthe set prediction engine 732 can generate the trust set 720 as:

$\begin{matrix}\left\{ {y:\left( {f(X)} \right|_{y} \geq 1 - q} \right\} & \text{­­­(7)}\end{matrix}$

where y denotes patient categories, ƒ(X)|_(y) denotes the classificationscore for patient category y, and q denotes the quantile value 730.

As another example, if the scoring function is provided by equation (5),then the set prediction engine 732 can generate the trust set as:

$\begin{matrix}{\left\{ {\pi_{1},\ldots,\pi_{k}} \right\},\text{where}k = \inf\left\{ {k:{\sum\limits_{j = 1}^{k}\left( {f(x)} \right|_{\pi_{j}}} \geq q} \right\}} & \text{­­­(8)}\end{matrix}$

where

(π_(j))_(j = 1)^(K)

is a permutation of {1, ..., K} that sorts the K classification scoresfrom highest to lowest, {π₁, ..., π_(k)} are the indices of the patientcategories included in the trust set 720, ƒ(X)|_(πj) is theclassification score for the patient category indexed by π_(j), inf {·}is the infimum operator, and q is the quantile value 730.

Generating the trust set 720 using the procedure described above canresult in the patient being included in a patient category within thetrust set with at least the threshold probability, irrespective of theaccuracy of the patient classification system used to generate theclassification scores 710.

The trust set 720 for a patient is adapted to the difficulty anduncertainty of the patient classification. For example, a trust set witha larger number of patient categories can reflect a more uncertainpatient classification, while a trust set with a smaller number ofpatient categories can reflect a more certain classification.

Generally, the set prediction system 718 is not required to re-computethe quantile value 730 associated with a threshold probability 716 eachtime the set prediction system 718 generates a trust set 720 for apatient. Rather, the set prediction system 718 can compute the quantilevalue 730 associated with a threshold probability once, and thereafterstore and reuse the quantile value 730 each time the set predictionsystem 718 is called upon to generate a trust set 720 based on thethreshold probability 716.

FIG. 8 shows an example recommendation system 800. The recommendationsystem 800 is an example of a system implemented as computer programs onone or more computers in one or more locations in which the systems,components, and techniques described below are implemented.

The recommendation system 800 uses the patient classification system 700(as described with reference to FIG. 7A), the cluster analysis system600 (as described with reference to FIG. 6 ), the set recommendationsystem 800 (as described with reference to FIG. 8 ), and arecommendation engine 816 to generate a clinical recommendation 818 fora patient 802. The clinical recommendation 818 can be, e.g., arecommendation related to medical treatment of the patient.

In particular, the patient classification system 700 processesmulti-modal data 804 characterizing the patient 802 to generate apatient classification 714 that classifies the patient as being includedin a patient category from a set of patient categories.

The cluster analysis system 600 processes data identifying the patientcategory of the patient 802 to generate a class distribution 812. Theclass distribution 812 defines, for each class in a set of classes, afraction of patients included in the patient category that areassociated with the class (as described above with reference to FIG. 6).

Optionally, in combination with or as an alternative to generating theclass distribution 812 based on the patient classification 714, thecluster analysis system 600 can generate a “combined” class distribution812 using the set of classification scores generated by the patientclassification system 700. In particular, as part of generating thepatient classification 714, the patient classification system 700generates a set of classification scores that includes a respectiveclassification score for each patient category in the set of patientcategories. The cluster analysis system 600 can generate a combinedclass distribution based on, for each patient category: (i) theclassification score for the patient category, and (ii) the classdistribution of the patient category.

In particular, the cluster analysis system 600 can generate the combinedclass distribution as a linear combination of the respective classdistribution corresponding to each patient category, where the classdistribution corresponding to each patient category is weighted in thelinear combination by the corresponding classification score. Morespecifically, the combined class distribution can define a respectivelikelihood score for each class in a set of classes based on, for eachpatient category: (i) the classification score for the patient category,and (ii) the fraction of patients included in the patient category thatare associated with the class. For example, the combined classdistribution can define a likelihood score L_(c) associated with eachclass c as:

$\begin{matrix}{L_{c} = {\sum\limits_{p = 1}^{P}{CS_{p} \cdot F_{p,c}}}} & \text{­­­(9)}\end{matrix}$

where p indexes the patient categories, P is the number of patientcategories, CS_(p) is the classification score for patient category p,and F_(p,c) is the fraction of patients included in patient category passociated with class c.

In some implementations, the cluster analysis system 600 can generatethe combined class distribution with reference to only a proper subsetof the patient categories in the set of patient categories. For example,the cluster analysis system 600 can generate the combined classdistribution with reference only to patient categories included a trustset 720 generated by the set prediction system 718, as will be describedin more detail below.

The recommendation system 800 generates one or more predictions 814characterizing the patient 802 based on the class distribution 812. Afew examples of possible predictions 814 are described next.

In one example, the set of classes can include one class for patientsthat are classified as having responded to a medical treatment, andanother class for patients that are classified as having not respondedto the medical treatment. In this example, the recommendation system 800can process the class distribution 812 for the patient category togenerate a prediction 814 for a likelihood that the patient 802 willrespond to the medical treatment. For example, the recommendation system800 can determine the likelihood that the patient 802 will respond tothe medical treatment as being the fraction of patients included in thepatient category that responded to the medical treatment. As anotherexample, if the class distribution is a combined class distribution (asdescribed above), the recommendation system 800 can determine thelikelihood that the patient 802 will respond to the medical treatment asbeing the likelihood assigned to the corresponding class by the combinedclass distribution.

In another example, the set of classes can include one class forpatients that have been classified as having experienced significantside effects from receiving a medical treatment, and another class forpatients that are classified as having not experienced significant sideeffects from receiving the medical treatment. In this example, therecommendation system 800 can process the class distribution 812 for thepatient category to generate a prediction 814 for a likelihood that thepatient 802 will experience significant side effects from receiving themedical treatment. For example, the recommendation system can determinethe likelihood that the patient 802 will experience significant sideeffects from receiving the medical treatment as being the fraction ofpatients included in the patient category that experienced significantside effects from receiving the medical treatment. As another example,if the class distribution is a combined class distribution (as describedabove), the recommendation system 800 can determine the likelihood thatthe patient 802 will experience significant side effects as being thelikelihood assigned to the corresponding class by the combined classdistribution.

In another example, the set of classes can include one class forpatients that have been diagnosed with a medical condition, and a secondclass for patients that have not been diagnosed with the medicalcondition. In this example, the recommendation system 800 can processthe class distribution 812 for the patient category to generate aprediction 814 for a likelihood of the patient 802 having the medicalcondition. For example, the recommendation system 800 can determine thelikelihood that the patient has the medical condition as being thefraction of patients included in the patient category that have beendiagnosed with the medical condition. As another example, if the classdistribution is a combined class distribution (as described above), therecommendation system 800 can determine the likelihood that the patient802 has the medical condition as being the likelihood assigned to thecorresponding class by the combined class distribution.

In addition to generating the prediction 814 characterizing the patient802, the recommendation system 800 can use the set recommendation system800 to generate a trust set 720 for the patient 802. More specifically,the set recommendation system 800 can process a set of classificationscores generated by the patient classification system 700, i.e.,including a respective classification score for each patient category inthe set of patient categories, to generate the trust set 720. The trustset 720 specifies a proper subset of the set of patient categories suchthat the patient is predicted to be include in a patient category withinthe trust set 720 with at least a threshold probability.

The recommendation system 800 can process the trust set 720 to derive anuncertainty measure 810, i.e., a numerical value that measures anuncertainty in the patient classification 714 generated by the patientclassification system 700. For example, the uncertainty measure 810 canrepresent a number of patient categories that are included in the trustset 720. Generally, a larger number of patient categories being includedin the trust set 720 indicates a higher uncertainty in the patientclassification 714.

The recommendation engine 816 can generate the clinical recommendation818 based on: (i) the prediction 814 characterizing the patient 802, and(ii) the uncertainty measure 810 characterizing uncertainty in thepatient classification 714.

More specifically, to generate the clinical recommendation 818, therecommendation engine 816 can evaluate a confidence criterion based atleast in part on the uncertainty measure 810. In response to determiningthat the confidence criterion is satisfied, the recommendation engine816 can map the prediction 814 characterizing the patient onto acorresponding clinical decision, and generate a clinical recommendation818 that includes the clinical decision. (Examples of mapping aprediction 814 onto a corresponding clinical decision are described inmore detail below). In response to determining that the confidencecriterion is not satisfied, the recommendation engine 816 can generate a“null” clinical recommendation 818, e.g., indicating that therecommendation system 800 lacks a required level of confidence togenerate a clinical recommendation 818.

The recommendation engine 816 can evaluate whether the confidencecriterion is satisfied based on the uncertainty measure 810, andoptionally, based on other factors as well, e.g., a number of patientsincluded in the patient category. Generally, a larger number of patientsbeing included in a patient category can decrease the uncertainty of aprediction 814 generated based on the inclusion of a patient in thepatient category. For example, a larger number of patients beingincluded in a patient category can decrease the effect of statisticalfluctuations on the class distribution of the patient category.

A few example techniques by which the recommendation engine 816 canevaluate whether the confidence criterion is satisfied are describednext.

In one example, the recommendation engine 816 can determine theconfidence criterion is satisfied if the uncertainty measure 810satisfies a threshold. For example, the recommendation engine 816 candetermine that the confidence criterion is satisfied if the uncertaintymeasure is less than N, where N can be, e.g., 2, 3, 5, or any otherappropriate positive integer value.

In another example, the recommendation engine 816 can determine theconfidence criterion is satisfied only if both: (i) the uncertaintymeasure satisfies an uncertainty threshold, and (ii) the number ofpatients included in the patient category satisfies (e.g., exceeds) athreshold. The threshold number of patients can be any appropriatenumber of patients, e.g., 10, 100, or 1000 patients.

If the recommendation engine 816 determines that the confidencecriterion is satisfied, then the recommendation engine 816 can map theprediction 814 characterizing the patient onto a corresponding clinicaldecision and generate a clinical recommendation 818 that includes theclinical decision, as described above. A few examples of clinicaldecisions corresponding to predictions 814 are described next.

In one example, a prediction 814 that the patient will respond to amedical treatment with at least a threshold likelihood (e.g., 75%, 90%,95%, or any other appropriate threshold likelihood) can be mapped onto aclinical decision to apply the medical treatment to the patient.Conversely, a prediction 814 that the patient will respond to a medicaltreatment with less than the threshold likelihood can be mapped onto aclinical decision not to apply the medical treatment to the patient.(Applying the medical treatment to the patient can include, e.g.,administering a drug to the patient). In some cases, the clinicaldecision can be implemented in practice, e.g., by applying the medicaltreatment to the patient.

In another example, a prediction 814 that the patient will experiencesignificant side effects from a medical treatment with at least athreshold likelihood (e.g., 75%, 90%, 95%, or any other appropriatethreshold likelihood) can be mapped onto a clinical decision not toapply the medical treatment to the patient. Conversely, a prediction 814that the patient will experience significant side effects from themedical treatment with less than the threshold likelihood can be mappedonto a clinical decision to apply the medical treatment to the patient.In some cases, the clinical decision can be implemented in practice,e.g., by applying the medical treatment to the patient.

In another example, a prediction 814 that the patient has a medicalcondition with at least a threshold likelihood (e.g., 75%, 90%, 95%, orany other appropriate threshold likelihood) can be mapped onto aclinical decision to diagnose the patient with the medical condition.Conversely, a prediction 814 that the patient has the medical conditionwith less than the threshold likelihood can be mapped onto a clinicaldecision not to diagnose the patient with the medical condition.

After generating the clinical recommendation 818, the recommendationsystem 800 can output the clinical recommendation 818, e.g., byproviding the clinical recommendation 818 to a user of the system, e.g.,by way of a user interface made available to the user.

As described above, the recommendation system 800 can processmulti-modal data 804 for patients 802 to generate correspondingpredictions 814 and/or clinical recommendations 818, in relation towhether a patient will respond to a medical treatment, will experiencesignificant side effects from a medical treatment, or should bediagnosed with a medical condition. In certain cases, particularcombinations of multi-modal data modalities may be particularlyeffective for generating predictions and clinical recommendations forcertain medical conditions. A few examples of possible combinations ofdata modalities that can be processed by the recommendation system togenerate predictions and/or clinical recommendations for certain medicalconditions are described next. It will be appreciated that theseexamples are provided for illustrative purposes only and do not limitthe potential use cases or applications of the techniques described inthis specification.

In some implementations, the recommendation system 800 processes geneexpression data, or clinical scale data (characterizing ALS severity, orrespiratory function, or both), or both to generate a prediction forwhether a patient with ALS will respond to a medical treatment, or aprediction for whether a patient with ALS will experience significantside effects from a medical treatment.

In some implementations, the recommendation system 800 processes acombination of one or more of: clinical scale data (e.g., obtained fromclinical interviews with the patient), EEG data, gene expression data,or neuroimaging data (e.g., fMRI data, or PET data, or both), togenerate a prediction for whether a patient with schizophrenia willrespond to a medical treatment, or a prediction for whether a patientwith schizophrenia will experience significant side effects from amedical treatment, or a prediction for whether a patient hasschizophrenia.

In some implementations, the recommendation system 800 processes acombination of one or more of: clinical scale data (e.g., obtained fromclinical interviews with the patient), gene expression data,neuroimaging data (e.g., fMRI data, or PET data, or both), or proteinexpression data, to generate a prediction for whether a patient withParkinson’s disease will respond to a medical treatment, or a predictionfor whether a patient with Parkinson’s disease will experiencesignificant side effects from a medical treatment, or a prediction forwhether a patient has Parkinson’s disease.

In some implementations, the recommendation system 800 processes acombination of one or more of: MRI data, EEG, data, or clinical scalesdata (e.g., obtained from clinical interviews with the patient), togenerate a prediction for whether a patient with major depressivedisorder (MDD) will respond to a medical treatment, or to generate aprediction for whether a patient with MDD will experience significantside effects from a medical treatment, or a prediction for whether apatient has MDD.

FIG. 9 shows an example training system 900. The training system 900 isan example of a system implemented as computer programs on one or morecomputers in one or more locations in which the systems, components, andtechniques described below are implemented.

The training system 900 jointly trains the encoder neural network 104and the decoder neural network 108 (as described with reference to FIG.1 and FIG. 2 ) on a set of training data 902 that includes multipletraining examples. Each training example corresponds to a respectivepatient and includes multi-modal data characterizing the patient.

The encoder neural network 104 is configured to process inputmulti-modal data characterizing a patient to generate an embedding ofthe input multi-modal data in a latent space. For example, the encoderneural network can process the input multi-modal data to generateparameters of a posterior probability distribution over the latentspace, e.g., mean and covariance parameters of a Normal distributionover the latent space. The encoder neural network 104 can then sample anembedding from the latent space in accordance with the posteriorprobability distribution over the latent space.

The decoder neural network 108 is configured to process an embeddingfrom the latent space to generate output multi-modal data.

To jointly train the encoder neural network 104 and the decoder neuralnetwork 108 on the training data 902, the training system 900 samples abatch (i.e., set) of training examples from the training data 902. Thetraining system 900 then jointly trains the encoder neural network 104and the decoder neural network 108 on each training example from thebatch.

To jointly train the encoder neural network 104 and the decoder neuralnetwork 108 on a training example from the batch, the training system900 processes the input multi-modal data 904 from the training exampleusing the encoder neural network 104, in accordance with values of a setof encoder neural network parameters, to generate an embedding 906 ofthe input multi-modal data 904. The training system 900 then processesthe embedding 906 of the input multi-modal data 904 using the decoderneural network 108, in accordance with values of a set of decoder neuralnetwork parameters, to generate “reconstructed” multi-modal data 908that defines a reconstruction (i.e., an estimate) of the inputmulti-modal data 904 from the training example.

A training engine 910 then determines gradients 918 of an objectivefunction 912 that depends on the reconstructed multi-modal data 908, anduses the gradients 918 to update the current parameter values of theencoder neural network 104 and the decoder neural network 108. Thetraining engine 910 can determine gradients of the objective function912 with respect to the current parameter values of the encoder neuralnetwork 104 and the decoder neural network 108, e.g., usingbackpropagation. The training engine 910 can update the currentparameter values of the encoder neural network 104 and the decoderneural network 108 using any appropriate gradient descent optimizationtechnique, e.g., RMSprop or Adam.

The objective function 912 includes a reconstruction loss 1000, andoptionally, one or more of: an archetype loss 1100, a clustering loss914, or a prior loss 916, which are each described in more detail below.For example, the objective function L can be given by:

$\begin{matrix}{L = \alpha_{1} \cdot L_{\mspace{6mu} r} + \alpha_{2} \cdot L_{\mspace{6mu} a} + \alpha_{3} \cdot L_{\mspace{6mu} c} + \alpha_{4} \cdot L_{\mspace{6mu} p}} & \text{­­­(10)}\end{matrix}$

where

(α_(i))_(i = 1)⁴

are scalar coefficients, L_(r) denotes the reconstruction loss 1000,L_(a) denotes the archetype loss 1100, L_(c) denotes the clustering loss914, and L_(p) denotes the prior loss 916. In some cases, one or more ofthe

(α_(i))_(i = 1)⁴

scalar coefficients have value zero at one or more training iterations,thereby removing corresponding terms from the objective function at thetraining iteration.

The reconstruction loss 1000 measures an error in the reconstructedmulti-modal data 908, i.e., the reconstruction loss 1000 measures anerror between: (i) the input multi-modal data 904 from the trainingexample, and (ii) the reconstructed multi-modal data 908 generated bythe decoder neural network 108. Training the encoder neural network 104and the decoder neural network 108 using the reconstruction loss 1000encourages the encoder neural network 104 to generate embeddings ofmulti-modal data that encode information characterizing properties ofthe multi-modal data that enable accurate reconstruction of themulti-modal data from the embeddings.

The reconstruction loss 1000 can include multiple scaling factors thateach scale a respective term in the reconstruction loss 1000 thatmeasures an error in a corresponding proper subset of the featuredimensions of the reconstructed multi-modal data 908. (As describedabove, the features representing multi-modal data, includingreconstructed multi-modal data 908, can be understood as being organizedinto a set of feature dimensions.) Thus each scaling factor controls therelative importance of the error in a corresponding proper subset of thefeature dimensions of the reconstructed multi-modal data 908 to thecalculation of the overall error in the reconstructed multi-modal data908.

As an example, the reconstruction loss L_(r) can have the form:

$\begin{matrix}{L_{\mspace{6mu} r} = {\sum\limits_{i = 1}^{n}{\beta_{i} \cdot L_{\mspace{6mu} r}\left( A_{i} \right)}}} & \text{­­­(11)}\end{matrix}$

where for each i ∈ {1, ..., n}: A_(i) designates a respective propersubset of the feature dimensions of the multi-modal data, L_(r)(A_(i))denotes an error in the proper subset A_(i) of the feature dimensions inthe reconstructed multi-modal data 908, and β_(i) is a scaling factorcorresponding to the proper subset A_(i) of the feature dimensions ofthe multi-modal data. Generally, each of the scaling factors

(β_(i))_(i = 1)^(n)

have different values. The error L_(r)(A_(i)) in a proper subset A_(i)of the feature dimensions in the reconstructed multi-modal data measuresan error between: (i) the proper subset A_(i) of the feature dimensionsin the input multi-modal data 904 from the training example, and (ii)the proper subset A_(i) of the feature dimensions in the reconstructedmulti-modal data 908 generated by the decoder neural network 108. Theerror between can be measured, e.g., using an L₁ similarity measure, anL₂ similarity measure, a cosine similarity measure, or any otherappropriate measure.

The value of each scaling factor in the reconstruction loss 1000 can beset based on a relevance of the corresponding proper subset of thefeature dimensions of the multi-modal data to a particular medicalcondition. In particular, scaling factors corresponding to propersubsets of the feature dimensions of the multi-modal data that are morerelevant to the medical condition can be set to higher values thanscaling factors corresponding to less relevant proper subsets of thefeature dimensions of the multi-modal data.

The values of the scaling factors in the reconstruction loss 1000 can bedetermined with reference to any appropriate medical condition, e.g., apsychiatric condition, e.g., depression or schizophrenia.

A few example techniques for determining the values of the scalingfactors in the reconstruction loss 1000 are described in more detailnext.

In some implementations, the relevance of a feature dimension of themulti-modal data to a medical condition is based on a relevance of thefeature dimension to a treatment for the medical condition. A fewexample techniques for determining the relevance of feature dimensionsof the multi-modal data to a treatment for a medical condition aredescribed next.

In one example, the training system 900 determines the relevance ofcertain feature dimensions of the multi-modal data to a treatment for amedical condition using a “pre/post dataset.” The pre/post datasetincludes, for each of one or more “reference” patients, a respective“pre-” value and a respective “post-” value for each feature in a set of“reference” features characterizing the reference patient. The pre-valueof each reference feature is measured prior to the reference patientreceiving the medical treatment. The post-value of each referencefeature is measured after the reference patient receives the medicaltreatment.

The pre/post dataset can include pre- and post- values for anyappropriate references features, in particular, for reference featuresof any appropriate modality. In one example, the reference features inthe pre/post dataset can include fMRI features, e.g., featuresrepresenting a functional connectivity matrix, features representing aprojection of a functional connectivity matrix onto a vector, orfeatures representing graph statistics characterizing a graph derivedfrom a functional connectivity matrix, as described above. In anotherexample, the reference features in the pre/post dataset can include EEGfeatures, e.g., features representing a Fourier transform of an EEGvoltage waveform. In another example, the reference features can beclinical scale features, e.g., characterizing patient mood andpersonality.

The training system 900 can determine a respective scaling factor foreach reference feature that characterizes a relevance of the referencefeature to the medical conditioned based on, for each reference patient,a difference between the pre- and post- values of the reference featurefor the reference patient. For example, the training system 900 candetermine a respective scaling factor in the reconstruction loss foreach feature dimension that corresponds to a reference feature based ona measure of central tendency (e.g., average or median) of thedifference between the pre- and post- values of the reference featurefor the reference patients. As part of determining the scaling factor inthe reconstruction loss for a reference feature, the training system 900can apply one or more transformation operations to the measure ofcentral tendency of the difference between the pre- and post- values ofthe reference feature for the reference patients. For example, thetraining system 900 can apply an absolute value transformation to themeasure of central tendency of the difference between the pre- andpost-values of the reference feature, e.g., to ensure that the resultingscaling factor is non-negative.

A change in the value of a reference feature after a medical treatmentis applied to a patient can, in some cases, be at least partiallyattributed to the effect of the medical treatment. For example, applyinga drug to patient to treat a psychiatric condition (e.g., psychosis orschizophrenia) may affect patterns of neural activity in the brain ofthe brain of the patient, and these changes may be reflected in fMRIfeatures. In particular, fMRI features that change significantly as aresult of applying the drug to the patient may measure properties of thebrain of the patient that are affected by the application of the drug.Thus the training system 900 can use the pre/post dataset to determinescaling factors for reference fMRI features (or, more generally, for anyappropriate features) that reflect the relevance of those features tothe drug.

The pre/post dataset can measure values of reference features for anyappropriate set of reference patients. The number of reference patientsincluded in the pre/post dataset can be, e.g., one patient, 10 patients,1000 patients, or any other appropriate number of patients. The set ofreference patients can be non-overlapping or only partially overlappingwith the set of “training” patient that provide multi-modal dataincluded in the training data 902 used for training the encoder neuralnetwork 104 and the decoder neural network 108. Determining scalingfactors in the reconstruction loss using the pre/post dataset thusprovides a way for the training system 900 to incorporate relevantinformation encoded in the pre- and post-feature value measurements forthe reference patients into the training by way of the reconstructionloss.

In another example, the training system 900 determines the relevance ofcertain feature dimensions of the multi-modal data to a drug that canapplied to treat a medical condition using positron emission tomography(PET) imaging. In particular, prior to administering the drug to areference patient, the drug can be labeled (i.e., tagged) with aradioactive tracer element (e.g., technetium-99m). After the drug isadministered, one or more PET images of the reference patient can becaptured, where the intensity of a voxel in a PET image can becorrelated with the presence of the radioactive tracer (and, byextension, the drug) at a corresponding location in the referencepatient. In some cases, the PET images can show the brain of thereference patient, and thus characterize the spatial distribution of thedrug in the brain of the reference patient.

The training system 900 can process the PET images to determine scalingfactors in the reconstruction loss. For example, the training system 900can process the PET images to generate a “penetration score” for eachbrain region in a set of brain regions that collectively define aparcellation of the brain of the reference patient. The penetrationscore for a brain region characterizes the concentration of the drug inthe brain region. The training system 900 can generate the penetrationscore for a brain region, e.g., by computing a measure of centraltendency (e.g., an average or median) of the intensities of the voxelsincluded in the brain region in the PET images.

The penetration score (i.e., as determined from PET images) for a brainregion can provide a scaling factor in the reconstruction loss forfeature dimensions (i.e., in multi-modal data for a patient) thatcharacterize the brain region. A few examples of using penetrationscores as scaling factors in the reconstruction loss are described next.

For example, the training system 900 can determine the scaling factor inthe reconstruction loss for a feature dimension that represents entry(i,j) in a functional connectivity matrix (i.e., representing thecorrelation between blood flow curves for brain region i and brainregion j in a parcellation) as a product of: (i) the penetration scorefor brain region i, and (ii) the penetration score for brain region j.As another example, the training system 900 can determine the scalingfactor for a multi-modal feature dimension that represents entry a sumof the entries in row i or column i in a functional connectivity matrixas the penetration score for brain region i. Thus, in these examples,the training system 900 uses PET imaging to determining scaling factorsin the reconstruction loss for feature dimensions corresponding to fMRIfeatures.

As another example, the training system 900 can determine the scalingfactor in the reconstruction loss for a feature dimension thatrepresents water diffusion in a brain region (e.g., as measured fromdiffusion tensor imaging (DTI)) as the penetration score for the brainregion.

In some implementations, the training system 900 determines therelevance of certain feature dimensions of the multi-modal data to amedical condition based on a correlation between: (i) the value of thefeature dimension, and (ii) diagnosis with the medical condition, for aset of reference patients. In particular, the training system 900 canset the value of the scaling factor for each corresponding featuredimension in the reconstruction loss based on the correlation betweenthe value of the feature dimension and diagnosis with the medicalcondition in the reference patients. A few examples of determiningscaling factors in the reconstruction loss in this manner are describednext.

In one example, each reference patient may be associated with: (i)genomic data that defines a respective expression level (i.e., in thereference patient) of each gene in a set of genes, and (ii) a labelindicating whether the reference patient has been diagnosed with themedical condition. The training system 900 can determine, for each gene,a correlation between the expression level of the gene and diagnosiswith the medical condition in the reference patients. The trainingsystem 900 can then set the value of the scaling factor for eachmulti-modal feature dimension that measures the expression level of agene based on the determined correlation between the expression level ofthe gene and diagnosis with the medical condition in the referencepatients.

In another example, each reference patient may be associated with: (i)proteomic data that defines a respective expression level (i.e., in thereference patient) of each protein in a set of proteins, and (ii) alabel indicating whether the reference patient has been diagnosed withthe medical condition. The training system 900 can determine, for eachprotein, a correlation between the expression level of the protein anddiagnosis with the medical condition in the reference patients. Thetraining system 900 can then set the value of the scaling factor foreach multi-modal feature dimension that measures the expression level ofa protein based on the determined correlation between the expressionlevel of the protein and diagnosis with the medical condition in thereference patients.

The scaling factors in the reconstruction loss 1000 can be associatedwith any appropriate proper subsets of the feature dimensions of themulti-modal data that jointly form a partition of the feature dimensionsof the multi-modal data. In some implementations, each modality in themulti-modal data is associated with a respective scaling factor in thereconstruction loss, i.e., such that each feature dimension of themulti-modal data corresponding to the same modality is associated withthe same scaling factor. In other implementations, feature dimensions ofthe multi-modal data corresponding to the same modality can beassociated with different scaling factors.

FIG. 10 illustrates an example of a reconstruction loss 1000 for a setof multi-modal data 1002. In this example, the multi-modal data 1002includes respective feature dimensions corresponding to modalities1004-A - 1004-D. The relevance of respective subsets of the multi-modaldata 1002 to a particular medical condition are illustrated by theshade, where darker shades indicate higher relevance (as illustrated bythe color bar 1010). The relevance of feature dimensions of multi-modaldata to a medical condition can be determined, e.g., based on theirrelevance to diagnosing the medical condition, as described above.

FIG. 10 further illustrates the values of scaling factors in thereconstruction loss 1000 corresponding to each feature dimension of themulti-modal data 1002. For example, the proper subset 1006-A of thefeature dimensions of the multi-modal data is associated with scalingfactor value 1008A-A, the proper subset 1006-B of the feature dimensionsof the multi-modal data is associated with scaling factor value 1008-B,and the proper subset 1006-C of the feature dimensions of themulti-modal data is associated with scaling factor value 1008-C. It canbe appreciated that feature dimensions of the multi-modal data that aremore relevant to the medical condition are associated with higherscaling factor values in the reconstruction loss 1000.

The scaling factors in the reconstruction loss encourage the multi-modaldata embeddings generated by the encoder neural network to efficientlyrepresent information relevant to the medical condition, thus increasingthe relevance of the embeddings to the medical condition. The scalingfactors in the reconstruction loss can thus enable the patientclustering system (described with reference to FIG. 5 ) to determinepatient categories that are more relevant to the medical condition. Thescaling factors in the reconstruction loss can thereby increase theutility and accuracy of predictions and clinical decisions based onclassifying patients into patient categories (as described withreference to FIG. 8 ).

In some implementations, rather than using the reconstruction loss 1000described above, the training system 900 can implement a reconstructionloss 1000 without scaling factors, e.g., a reconstruction loss thatmeasures reconstruction errors uniformly across each feature dimensionof the multi-modal data. For example, the training system 900 can use areconstruction loss 1000 that measures the error between: (i)reconstructed multi-modal data, and (ii) input multi-modal data, usingan L₁ similarity measure, an L₂ similarity measure, or any otherappropriate similarity measure.

The archetype loss 1100 is defined with reference to one or more“target” multi-modal data archetypes. Each target multi-modal dataarchetype represents multi-modal data of the same form (i.e., having thesame feature dimensions) as that provided as an input to the encoderneural network 104 and generated as an output by the decoder neuralnetwork 108. Each target multi-modal data archetype is associated with acorresponding dimension of the latent space and represents a target(i.e., desired) output to be generated by the decoder neural network 108by processing an embedding representing the dimension of the latentspace.

Generally, each dimension of the latent space can be represented by abasis embedding from a set of basis embeddings in the latent space thatprovide a basis of the latent space, as described above with referenceto FIGS. 3 . For instance, the set of basis embeddings can be given bythe set of unit embeddings in the latent space, where a unit embeddingrefers to an embedding where one position in the embedding has value 1and the other positions in the embedding have value 0.

For convenience, a dimension of the latent space that is associated witha corresponding target multi-modal data archetype is sometimes referredto as an “anchored” dimension of the latent space. Generally, only aproper subset of the dimensions of the latent space are anchoreddimensions.

To evaluate the archetype loss 1100, the training engine 910 generates,for each anchored dimension of the latent space, a “predicted”multi-modal data archetype for the anchored dimension by processing anembedding representing the anchored dimension using the decoder neuralnetwork 108. The archetype loss 1100 then measures, for each anchoreddimension of the latent space, an error between: (i) the targetmulti-modal data archetype for the anchored dimension, and (ii) thepredicted multi-modal data archetype for the anchored dimension.

Optionally, to evaluate the archetype loss 1100, the training engine 910can further process the respective target multi-modal data archetype foreach anchored dimension of the latent space to generate an embedding ofthe target multi-modal data archetype. The archetype loss then alsomeasures, for each anchored dimension of the latent space, an errorbetween: (i) a basis embedding (e.g., unit embedding) representing theanchored dimension of the latent space, and (ii) the embedding of thetarget multi-modal data archetype for the anchored dimension.

For example, the archetype loss L_(a) can have the form:

$\begin{matrix}{L_{\mspace{6mu} a} = {\sum\limits_{d = 1}^{D}{\alpha_{i} \cdot L_{\mspace{6mu} a}^{D}(d) + \beta_{i} \cdot L_{\mspace{6mu} a}^{E}(d)}}} & \text{­­­(12)}\end{matrix}$

where d indexes the anchored dimensions of the latent space, D is thenumber of anchored dimensions in the latent space,

(α_(i))_(i = 1)^(D)and(β_(i))_(i = 1)^(D)

are scalar coefficients,

L_( a)^(D)(d)

denotes the error between: (i) the target multi-modal data archetype foranchored dimension d, and (ii) the predicted multi-modal data archetypefor anchored dimension d, and

L_( a)^(E)(d)

denotes the error between: (i) the embedding of the target multi-modaldata archetype for anchored dimension d, and (ii) the basis embedding(e.g., unit embedding) representing anchored dimension d in the latentspace. In some cases, the

(α_(i))_(i = 1)^(D)

coefficients all have value zero and the

(β_(i))_(i = 1)^(D)

coefficients all have non-zero values. In other cases, the

(β_(i))_(i = 1)^(D)

coefficients all have value zero and

(α_(i))_(i = 1)^(D)

coefficients all have non-zero values. In other cases, one or more ofthe

(α_(i))_(i = 1)^(D)

coefficients have non-zero values and one or more of the

(β_(i))_(i = 1)^(D)

coefficients have non-zero values.

The training engine 910 can measure the error between a targetmulti-modal data archetype and a predicted multi-modal data archetype inany appropriate way. A few example techniques for measuring the errorbetween a target multi-modal data archetype and a predicted multi-modaldata archetype are described next.

In some implementations, the training engine 910 can measure the errorbetween a target multi-modal data archetype and a predicted multi-modaldata archetype using an error measure that is analogous to thereconstruction loss described above, e.g., with reference to equation(11). That is, the predicted multi-modal data archetype can beunderstood as a “reconstruction” of the target multi-modal dataarchetype, and the training engine 910 can measure the error between thetarget multi-modal data archetype and the predicted multi-modal dataarchetype using the reconstruction loss described above. Morespecifically, the error measure can include multiple scaling factorsthat each scale a respective term in the error measure that measures anerror between the target multi-modal data archetype and the predictedmulti-modal data archetype along a proper subset of the multi-modalfeature dimensions. The value of each scaling factor in the errormeasure can be set based on a relevance of the corresponding propersubset of the feature dimensions to a particular medical condition, asdescribed above.

For example, training engine 910 can measure the error E(T, P) between atarget multi-modal data archetype T and a predicted multi-modal dataarchetype P as:

$\begin{matrix}{E\left( {T,P} \right) = {\sum\limits_{i = 1}^{n}{\beta_{i} \cdot E\left( {T_{A_{i}},P_{A_{i}}} \right)}}} & \text{­­­(13)}\end{matrix}$

where for each i ∈ {1, ..., n}: A_(i) designates a respective propersubset of the feature dimensions of the multi-modal data, β_(i) is ascaling factor corresponding to the proper subset A_(i) of themulti-modal feature dimensions, and E(T_(Ai) , P_(Ai) ) measures anerror between: (i) the proper subset A_(i) of the feature dimensions inthe target multi-modal data archetype T, and (ii) the proper subsetA_(i) of the feature dimensions in the predicted multi-modal dataarchetype P. The error E(T_(Ai) , P_(Ai) ) between can be measured,e.g., using an L₁ similarity measure, an L₂ similarity measure, a cosinesimilarity measure, or any other appropriate measure. Generally, thescaling factors

(β_(i))_(i = 1)⁴

each have different values.

In some implementations, the training system 900 can measure the errorbetween a target multi-modal data archetype and a predicted multi-modaldata archetype using an error measure without scaling factors. Forexample, the training system 900 can measure the error between a targetmulti-modal data archetype and a predicted multi-modal data archetypeusing an L₁ similarity measure, L₂ similarity measure, cosine similaritymeasure, or any other appropriate similarity measure.

The training engine 910 can measure the error between: (i) an embeddingof a target multi-modal data archetype for an anchored dimension, and(ii) a basis embedding representing the anchored dimension, in anyappropriate way, e.g., as an L₂ error.

The anchored dimensions of the latent space and the target archetypesfor the anchored dimensions, can be determined in a variety of possibleways. A few example techniques for determining the anchored dimensionsof the latent space and the target archetypes for the anchoreddimensions are described next.

In some implementations, the training system 900 can initially train theencoder neural network 104 and the decoder neural network 108 for apredefined number of training iterations using an objective function 912that does not include the archetype loss 1100. The training system 900can then generate a respective multi-modal data archetype correspondingto each dimension of the latent space, i.e., by processing a respectiveembedding representing each dimension of the latent space using thedecoder neural network 108. The training system 900 can optionallygenerate an archetype representation for each generated multi-modal dataarchetype, as described with reference to FIGS. 3 .

The training system 900 can provide the multi-modal data archetypes(and, optionally, the archetype representations) to a user, e.g.,through a user interface on a user device. The user can provide aninput, e.g., through the user interface on the user device, indicatingthat one or more of the multi-modal data archetypes should be designatedas target multi-modal data archetypes (and, by extension, that thecorresponding dimensions of the latent space should be anchoreddimensions). The training system 900 can then resume training of theencoder neural network 104 and the decoder neural network 108 using anobjective function 912 that includes an archetype loss 1100corresponding to the target multi-modal data archetypes specified by theuser.

Thus the first stage of training, i.e., without the archetype loss 1100,can be used to generate a set of “candidate” multi-modal data archetypesfrom which a user can select (e.g., by way of a user interface on a userdevice) target multi-modal data archetypes for use in the second stageof training, i.e., with the archetype loss 1100.

In some implementations, a user can directly specify (e.g., through auser interface on a user device) one or more target multi-modal dataarchetypes, e.g., from a predefined set of multi-modal data archetypes.

Generally, a user can select target multi-modal data archetypes usingany appropriate criteria, and selecting target multi-modal dataarchetypes enables a user to control how the encoder neural networkrepresents multi-modal data in the latent space. This provides asignificant advantage over training paradigms that treat the encoder anddecoder neural networks as uninterpretable “black boxes” that can a usercan control only indirectly, e.g., through the choice of training data.

Moreover, users can select target multi-modal data archetypes thatrepresent clinically meaningful patterns in multi-modal datacharacterizing patients. In particular, users can select targetmulti-modal data archetypes that are relevant to a particular medicalcondition, e.g., that include multi-modal features that typicallyco-occur in patients having the medical condition. Thus the archetypeloss can encourage the multi-modal data embeddings generated by theencoder neural network to efficiently represent information relevant tothe medical condition. The archetype loss can thus enable the patientclustering system (described with reference to FIG. 5 ) to determinepatient categories that are more relevant to the medical condition. Thearchetype loss can thereby increase the utility and accuracy ofpredictions and clinical decisions based on classifying patients intopatient categories (as described with reference to FIG. 8 ).

FIG. 11 illustrates an example of an archetype loss 1100. In thisexample, the latent space 1112 includes one “anchored” latent dimension1110, i.e., that is associated with a target multi-modal data archetype1102 in the multi-modal data space 1116, and one “flexible” (i.e.unanchored) latent dimension 1114, i.e., that is not associated with atarget multi-modal data archetype. The decoder neural network 108 mapsthe anchored latent dimension 1110 to the predicted multi-modal dataarchetype 1106, and the decoder neural network 108 maps the flexiblelatent dimension 1114 to the predicted multi-modal data archetype 1108.

During training, the parameter values of the decoder neural network 108are iteratively adjusted, which causes the predicted archetypescorresponding to the latent dimensions of the latent space toiteratively change over the course of training. The archetype loss 1100“anchors” 1104 the predicted archetype 1106 to the target archetype1102, i.e., by penalizing deviations of the predicted archetype 1106 forthe anchored latent dimension 1110 from the corresponding targetarchetype 1102. In contrast, the archetype loss 1100 does not anchorpredicted archetype 1108 corresponding to the flexible dimension of thelatent space, instead allowing its position in the multi-modal dataspace 1116 to vary flexibly over the course of training.

The clustering loss 914 at a training iteration is computed based on aclustering, in the latent space, of the embeddings 906 of themulti-modal data 904 from the training examples in the current batch oftraining examples. The clustering loss 914 can encourage the embeddings906 to separate into clusters in the latent space, and can reduce anydependence of the clusters on “confounding” features, e.g., featuresthat are designated as being substantially irrelevant, e.g., to amedical condition or to a treatment for a medical condition. Theclustering loss 914 can be generated by a cluster hardening system, aswill be described in more detail with reference to FIG. 12 .

The prior loss 916 measures, for each training example, an errorbetween: (i) the posterior probability distribution over the latentspace generated by the encoder neural network 104 for the trainingexample, and (ii) a predefined “prior” probability distribution over thelatent space. The prior probability distribution can be, e.g., astandard Normal probability distribution, i.e., with a mean vector ofzeros and a covariance matrix given by the identity matrix. The trainingengine 910 can measure the error between the posterior probabilitydistribution and the prior probability distribution, e.g., using aKullback-Leibler divergence measure.

FIG. 12 shows an example cluster hardening system 1200. The clusterhardening system 1200 is an example of a system implemented as computerprograms on one or more computers in one or more locations in which thesystems, components, and techniques described below are implemented.

The cluster hardening system 1200 operates in conjunction with thetraining system 900, described in more detail with reference to FIG. 9 ,that jointly trains an encoder neural network and a decoder neuralnetwork. The encoder neural network is configured to process multi-modaldata characterizing a subject to generate an embedding of themulti-modal data in a latent space. The decoder neural network isconfigured to process the embedding of the multi-modal data in thelatent space to generate a reconstruction of the original multi-modaldata.

The training system 900 can jointly train the encoder neural network andthe decoder neural network over multiple training iterations to optimizean objective function. The objective function can include a variety ofterms, e.g., a reconstruction loss, an archetype loss, and a prior loss,which are each described in more detail with reference to FIG. 9 . Inparticular, at one or more training iterations, the objective functioncan include a “clustering loss” 914, which will be described in moredetail next.

The cluster hardening system 1200 can generate the clustering loss 914at one or more training iterations during the joint training of theencoder neural network and the decoder neural network by the trainingsystem 900. The training system 900 receives the clustering loss 914from the cluster hardening system 1200, and includes the clustering loss914 as a term in the objective function, e.g., as described above withreference to equation (10).

At each training iteration, the training system 900 samples a currentbatch (set) of training examples (e.g., from a pool of trainingexamples), where each training example corresponds to a respectivesubject and includes multi-modal data that characterizes the subject.The training system 900 then processes the multi-modal data from eachtraining example using the encoder neural network to generate arespective embedding 906 of the multi-modal data from each trainingexample in the latent space. That is, the training system 900 generatesa set of embeddings 906, including a respective embedding 906corresponding to each training example in the current batch of trainingexamples.

The cluster hardening system 1200 is configured to receive a set ofembeddings 906 generated by the training system 900 at a trainingiteration, and to process the set of embeddings 906 to generate aclustering loss 914 for the training iteration. The cluster hardeningsystem 1200 includes a clustering engine 1202, a training engine 1206,one or more classification machine learning models 1208, and anevaluation engine 1210, which are each described in more detail next.

The clustering engine 1202 receives the set of embeddings 906, andapplies a clustering operation to the set of embeddings 906 to generatea partition of the set of embeddings 906 into multiple groups, referredto as “clusters,” that each include multiple embeddings 906. Generally,the clustering operation partitions the set of embeddings into clusterssuch that embeddings in the same cluster tend to be more similar(accordingly to a similarity measure in the latent space) thanembeddings in different clusters. FIG. 13 , which will be described inmore detail below, provides a visual illustration of applying aclustering operation to a set of embeddings.

As a result of the clustering, each embedding 906 is designated asbelonging to a respective cluster in a set of clusters. Thus eachembedding 906 can understood as being associated with a “cluster label”1204 that identifies the respective cluster that includes the embedding.

The clustering engine 1202 can cluster the set of embeddings 906 usingany appropriate clustering operation, e.g., a k-means clusteringoperation, an expectation maximization clustering operation, ahierarchical agglomerative clustering operation, or a spectralclustering operation. The numbers of clusters generated by theclustering engine 1202 can be, e.g., a predefined hyper-parameter, ordetermined dynamically by the clustering engine 1202 during clustering.

The training engine 1206 trains one or more classification machinelearning models 1208 using the cluster labels 1204 associated with theembeddings 906. A few example techniques by which the training engine1206 can train a classification machine learning model 1208 aredescribed next.

In some implementations, the training engine 1206 trains aclassification machine learning model 1208 to process an embedding 906to generate a classification output that predicts the cluster label 1204of the embedding 906. For example, the classification output can includea respective score for each cluster label that defines a likelihood thatthe embedding is associated with the cluster label. For convenience, aclassification machine learning model 1208 that is trained to process anembedding 906 to predict the cluster label 1204 of the embedding 906 maybe referred to herein as a “cluster classification model.”

In some implementations, the training engine 1206 trains aclassification machine learning model 1208 to process a set of“confounding” features associated with a subject to generate aclassification output that predicts the cluster label 1204 of theembedding 906 of the multi-modal data characterizing the subject. Aconfounding feature can refer to a feature that has been designated asbeing substantially irrelevant, e.g., to a medical condition, or to atreatment for a medical condition. That is, confounding features measurevariation between subjects that, for the purpose of generatingpredictions related to a medical condition or a treatment to a medicalcondition, represent “noise” rather than features representingbiological characteristics relevant to the predictions. For convenience,a classification machine learning model 1208 that is trained to processa set of confounding features associated with a subject to predict thecluster label 1204 of the embedding 906 of the multi-modal datacharacterizing the subject may be referred to herein as an “adversarialclassification model.”

A few examples of possible confounding features are described next.

In one example, multi-modal data characterizing a subject includessensor data captured by a sensor, and the set of confounding featuresassociated with the subject can include one or more features of thesensor. For example, the sensor can be a medical imaging sensor, e.g., amagnetic resonance imaging (MRI) machine or a computed tomography (CT)machine, and the confounding features can characterize the manufacturerof the medical imaging sensor, hardware included in the medical imagingsensor, software included in the medical imaging sensor, or calibrationparameters of the medical imaging sensor.

In another example, the set of confounding features associated with asubject can include one or more features characterizing an acquisitionprotocol used to acquire parts of the multi-modal data characterizingthe subject. For example, the multi-modal data can include genomic data,and the confounding features can characterize the acquisition protocolused to generate the genomic data, e.g., a number of chimeric reads, anaverage length of a read, a number of reads mapped to multiple loci, afraction of reads aligning to the mitochondrial genome, area undercoverage for all alignments, or a combination thereof.

In another example, the set of confounding features associated with asubject can include one or more of: an education level of the subject, ahome address of the subject, identities of physicians that previouslyinteracted with the subject, an employment history of the subject, afamilial history of the subject (e.g., characterizing medical history,or relationship history, or both), a medical history of the subject, ora combination thereof.

It can be appreciated that the examples of possible confounding featuresprovided above are not exhaustive, and that the selection of appropriateconfounding features may depend on the context of a particularapplication, e.g., on a particular medical condition or treatment ofinterest. The designation of certain features as confounding featuresmay be specified, e.g., by a user of the training system 900. Generally,the set of confounding features associated with a subject are notincluded in the set of multi-modal data characterizing the subject,i.e., that is processed by the encoder neural network to generate theembedding 906 corresponding to the subject.

In some implementations, the training engine 1206 trains both: (i) acluster classification model (i.e., that processes embeddings 906), and(ii) an adversarial classification model (i.e., that processesconfounding features).

As part of training a classification machine learning model 1208, thetraining engine 1206 can partition the set of embeddings 906 into: (i) aset of “training” embeddings, and (ii) a set of “validation” embeddings.The training engine 1206 trains the classification machine learningmodel 1208 to predict the cluster labels associated with the trainingembeddings, but refrains from the training the classification machinelearning model 1208 to predict the cluster labels associated with thevalidation embeddings. That is, the validation embeddings and theirassociated cluster labels are “held out” from the training of theclassification machine learning model 1208.

The training engine 1206 can partition the set of embeddings 906 into aset of training embeddings and a set of validation embeddings in anyappropriate way. For example, the training engine 1206 can randomlyselect a predefined fraction (e.g., 90%, or any other appropriatefraction) of the embeddings in the set of embeddings as trainingembeddings, and designate the remaining embeddings as being validationembeddings.

Each classification machine learning model 1208 can be any appropriatetype of machine learning model, e.g., a neural network model, a randomforest model, or a support vector machine model, and can have anyappropriate machine learning model architecture. For example, inimplementations where a classification machine learning model is aneural network model, the neural network model can include anyappropriate types of neural network layers (e.g., fully connectedlayers, convolutional layers, or attention layers) in any appropriatenumbers (e.g., 2 layers, 5 layers, or 10 layers) and connected in anyappropriate configuration (e.g., as a linear sequence of layers).

The training engine 1206 can train each classification machine learningmodel 1208, using any appropriate machine learning training technique,to optimize a classification loss. For example, if a classificationmachine learning model 1208 is implemented as a neural network model,then the training engine 1206 can train the classification machinelearning model 1208 using a stochastic gradient descent trainingtechnique. The classification loss measures an accuracy of theclassification outputs generated by a classification machine learningmodel, e.g., the classification loss can be a cross-entropy loss.

After the training engine 1206 trains each classification machinelearning model 1208, the evaluation engine 1210 measures the respectiveclassification accuracy of each classification machine learning model1208, and determines the clustering loss 914 based on the respectiveclassification accuracies of the classification machine learning models1208.

More specifically, the evaluation engine 1210 can measure theclassification accuracy of a classification machine learning model 1208with reference to the cluster labels associated with the embeddings 906designated as validation embeddings (as described above). The evaluationengine 1210 generally excludes the cluster labels associated with theembeddings designated as training embeddings from the evaluation of theclassification accuracy of the classification machine learning model1208.

To measure the classification accuracy of a classification machinelearning model 1208 on the validation embeddings, the evaluation engine1210 uses the classification machine learning model to generate arespective classification output for each validation embedding thatdefines a prediction for the cluster label of the validation embedding.For example, if the classification machine learning model 1208 is acluster classification model, then the evaluation engine 1210 processeseach respective validation embedding to generate a respectiveclassification output for the validation embedding. As another example,if the classification machine learning model 1208 is an adversarialclassification model, then the evaluation engine 1210 processes arespective set of confounding features corresponding to each validationembedding to generate a respective classification output for thevalidation embedding.

After generating a classification output for each validation embedding,the evaluation engine 1210 can measure a respective classification error(e.g., cross-entropy error) between: (i) the classification output, and(ii) the cluster label, for each validation embedding. The evaluationengine 1210 can then determine the overall classification error for theclassification machine learning model 1208 based on the respectiveclassification error of the classification machine learning model 1208for each validation embedding. For example, the evaluation engine 1210can determine the overall classification error for the classificationmachine learning model as a measure of central tendency (e.g., a mean,median, or mode) of the classification errors of the classificationmachine learning model 1208 for the validation embeddings.

The evaluation engine 1210 can determine the clustering loss 914 basedon (e.g., as a function of) the respective overall classification errorof each classification machine learning model 1208. For example, theevaluation engine 1210 can define the clustering loss 914 as:

$\begin{matrix}{L_{\mspace{6mu} c} = \gamma_{1} \cdot L_{\mspace{6mu} cluster} - \gamma_{2} \cdot L_{\mspace{6mu} adversarial}} & \text{­­­(14)}\end{matrix}$

where L_(cluster) denotes the overall classification error of thecluster classifier model, L_(adversarial) denotes the overallclassification error of the adversarial classifier model, and

(γ_(i))_(i = 1)²

are positivescalar weighting factors. In some cases, the clustering loss914 includes only the overall classification error of the clusterclassifier model and excludes the overall classification error of theadversarial classification model (i.e., such that γ₁ > 0 and γ₂ = 0). Insome cases, the clustering loss 914 includes only the overallclassification error of the adversarial classifier model and excludesthe overall classifier error of the cluster classification model (i.e.,such that γ₁ = 0 and γ₂ > 0).

In some implementations, the clustering loss 914 is based at least inpart on the overall classification error of the cluster classifiermodel. In these implementations, optimizing the clustering loss 914,e.g., as part of optimizing the objective function used to jointly trainthe encoder and decoder neural networks, encourages an increase in theoverall classification accuracy of the cluster classification model. Theoverall classification accuracy of the cluster classification modelwould be enhanced if the embeddings are distributed in clusters (i.e.,spatially separated groupings of embeddings) in the latent space, e.g.,as opposed to being uniformly distributed through the latent space.Thus, optimizing the clustering loss 914 encourages the embeddings 906to separate into clusters in the latent space. FIG. 14 , which will bedescribed in more detail below, provides a visual illustration of theclustering loss encouraging the embeddings 906 to separate into clustersin the latent space.

By encouraging the embeddings 906 to separate into clusters in thelatent space, the clustering loss 914 can increase the likelihood thatembeddings in the latent space can be unambiguously assigned tocorresponding clusters of embeddings. In particular, the clustering loss914 can encourage greater similarity between embeddings in the samecluster, and greater difference between embeddings in differentclusters.

The patient classification system, described with reference to FIG. 7A,can define each cluster as a patient category, where a subject isincluded in a patient category if an embedding of multi-modal datacharacterizing the subject is included in a cluster representing thepatient category. In particular, the patient classification system cangenerate predictions for a subject (e.g., for whether the subject has amedical condition, or will respond to a treatment for a medicalcondition) based on the patient category of the subject. Thus byencouraging embeddings to separate into clusters in the latent space,the clustering loss 914 can enable the patent classification system tounambiguously assign a patient to a corresponding patient category andgenerate predictions characterizing the subject (i.e., based on thepatient category of the subject) with higher accuracy.

In some implementations, the clustering loss 914 is based at least inpart on the overall classification error of the adversarial classifiermodel. In these implementations, optimizing the clustering loss 914,e.g., as part of optimizing the objective function used to jointly trainthe encoder and decoder neural networks, encourages a decrease in theoverall classification accuracy of the adversarial classification model.For example, with reference to equation (14), the overall classificationerror of the adversarial classification model (L_(adversarial)) may bescaled by a negative factor in the clustering loss (-γ₂), such thatminimizing the clustering loss encourages maximizing the overallclassification error of the adversarial classifier model.

Thus, optimizing the clustering loss 914 can encourage confoundingfeatures corresponding to embeddings with different cluster labels tobecome more “entangled” in the confounding feature space. That is,optimizing the clustering loss 914 can encourage an increase in thesimilarity, measured in the confounding feature space, betweenconfounding features corresponding to embeddings with different clusterlabels relative to confounding features corresponding to embeddings withthe same cluster label. FIG. 15 , which will be described in more detailbelow, provides a visual illustration of the clustering loss encouragingconfounding features corresponding to embeddings 906 with differentcluster labels with become more entangled in the confounding featurespace.

By encouraging confounding features corresponding to embeddings 906 withdifferent cluster labels 1204 to become entangled in the confoundingfeature space, the clustering loss 914 can increase the relevance of theclusters, e.g., to a medical condition or to a treatment for a medicalcondition. In particular, confounding features are, are described above,features that have been designated as being substantially irrelevant,e.g., to a medical condition or to a treatment for a medical condition.Therefore causing confounding features corresponding to embeddings 906with different cluster labels 1204 to become entangled in theconfounding feature space can reduce any dependence of the clusters onthe confounding features. In particular, increased entanglement in theconfounding feature space reduces the likelihood that clusters in thelatent space can be delineated or distinguished on the basis ofconfounding features.

In some implementations, a cluster classification model that the clusterhardening system 1200 trains to optimize the clustering loss 914 can beprovided for use in classifying new patients into patient categories.For instance, the patient classification system 700, described withreference to FIG. 7A, can use a cluster classification model trained tooptimize the clustering loss 914 to implementing the scoring engine 706for classifying new patients into patient categories.

FIG. 13 provides a visual illustration of applying a clusteringoperation to a set of embeddings. Each embedding is represented by acircle. Embeddings included in “cluster #1” are shown as light coloredcircles, and embeddings included in “cluster #2” are shown as darkcolored circles. For convenience, the embeddings are shown withreference to two dimensions of the latent space (i.e., “latent dimensionA” and “latent dimension B”), but in some cases the latent spaceincludes more than two dimensions. It can be appreciated that embeddingsin the same cluster tend to be more similar, in the latent space, thanembeddings in different clusters.

FIG. 14 provides a visual illustration of the clustering lossencouraging embeddings to separate into clusters in the latent space.Diagram 1402 shows the distribution of the embeddings in the latentspace at a first training iteration, and diagram 1404 shows thedistribution of the embeddings in the latent space at a subsequenttraining iteration. The clustering loss has been included in theobjective function being optimized during the joint training of theencoder and decoder neural networks at one or more intervening trainingiterations between the first training iteration and the subsequenttraining iteration. Each embedding is represented as a circle. Forconvenience, the embeddings are shown with reference to two dimensionsof the latent space (i.e., “latent dimension A” and “latent dimensionB”), but in some cases the latent space includes more than twodimensions. It can be appreciated that the clustering loss encouragesthe embeddings to separate into clusters in the latent space.

FIG. 15 provides a visual illustration of the clustering lossencouraging confounding features corresponding to embeddings withdifferent cluster labels to become more entangled in the confoundingfeature space. Diagram 1502 shows the distribution of confoundingfeatures corresponds to embeddings at a first training iteration, anddiagram 1504 shows the distribution of confounding featurescorresponding to embeddings at a subsequent training iteration. Theclustering loss has been included in the objective function beingoptimized during the joint training of the encoder and decoder neuralnetworks at one or more intervening training iterations between thefirst training iteration and the subsequent training iteration.

Confounding features corresponding to embeddings in a first cluster areshown as light colored circles, and confounding features correspondingto embeddings in a second cluster are shown as dark colored circles. Forconvenience, the confounding features are shown with reference to twodimensions of the confounding feature space (i.e., dimensionscorresponding to “confounding feature dimension A” and “confoundingfeature dimension B”), but in some cases the confounding feature spacein includes more than two dimensions.

It can be appreciated that the clustering loss encourages confoundingfeatures corresponding to embeddings in different clusters to becomemore entangled in the confounding feature space.

FIG. 16 shows an example conditioning system 1600. The conditioningsystem 1600 is an example of a system implemented as computer programson one or more computers in one or more locations in which the systems,components, and techniques described below are implemented.

The conditioning system 1600 is configured to receive multi-modal data1614 characterizing a “target” subject 1612. The multi-modal data 1614includes a respective feature representation for each modality in a setof modalities, where a feature representation for a modality refers to acollection of features that collectively represent data from themodality, as described above.

The conditioning system 1600 conditions the multi-modal data 1614characterizing the target subject 1612 based on conditioning data,derived from a population of “reference” subjects 1602, to generateconditioned multi-modal data 1616. (A “population” of subjects refers toa set of one or more subjects, and can include any appropriate number ofsubjects, e.g., 10, 100, or 1000 subjects). The conditioning system 1600thus enriches the multi-modal data 1614 characterizing the targetsubject 1612 with conditioning data derived from the population ofreference subjects 1602, as will be described in more detail below.

Conditioned multi-modal data 1616 generated by the conditioning system1600 can be used in any of a variety of applications. For example,conditioned multi-modal data 1616 generated by the conditioning system1600 for a target subject 1612 can be provided as an input to a machinelearning model 1618. The machine learning model 1618 can process theconditioned multi-modal data 1616 to generate a machine learning modeloutput characterizing the target subject 1612.

The machine learning model 1618 (i.e., that processes the conditionedmulti-modal data 1616) can have any appropriate machine learning modelarchitecture (e.g., a neural network architecture, a random forestarchitecture, or a support vector machine architecture), and can beconfigured to generate any appropriate machine learning model output.For example, the machine learning model output can be a predictioncharacterizing the target subject 1612, e.g., a prediction for whetherthe target subject 1612 has a particular medical condition, a predictionfor whether the target subject 1612 would respond to a particularmedical treatment, or a prediction for a prognosis of the target subject1612. As another example, the machine learning model output can be anembedding representing the target subject 1612.

In particular, the conditioning system 1600 can be used to pre-processmulti-modal data provided to the machine learning system 100 describedwith reference to FIG. 1 . For example, the conditioning system 1600 canpre-process multi-modal data used by the training system 900, describedwith reference to FIG. 9 , for training the encoder neural network andthe decoder neural network of the machine learning system 100. Asanother example, the conditioning system 1600 can pre-processmulti-modal data characterizing a subject that is processed by themachine learning system 100 to classify the subject as being included inpatient category, e.g., as described with reference to FIG. 7A.

Conditioning multi-modal data 1614 characterizing a target subject 1612prior to processing the multi-modal data 1614 using a machine learningmodel can enable the machine learning model to operate more effectively.For example, the machine learning model can leverage the enhancedinformation content of the conditioned multi-modal data 1616 to generatepredictions with higher accuracy, or to generate richer featureembeddings characterizing the target subject 1612.

The conditioning system 1600 can condition multi-modal data 1614characterizing a target subject 1612 based on a population of referencesubjects 1602 using a representation engine 1606 and a conditioningengine 1610, which are each described in more detail next.

The representation engine 1606 is configured to obtain one or morefeature representations 1604, corresponding to a reference modality, foreach reference subject 1602. The reference modality can be anyappropriate modality, e.g., an fMRI modality, a PET modality, a genomicdata modality, or a proteomic data modality. The representation engine1606 can obtain the feature representations 1604 of the referencesubjects, e.g., by retrieving the feature representations 1604 of thereference subjects 1602 from a data store, e.g., a physical data storagedevice or a logical data storage area.

In some cases, the conditioning system 1600 obtains, for each referencesubject 1602, a “pre-treatment” feature representation of the referencesubject 1602 and a “post-treatment” feature representation of thereference subject 1602. The pre-treatment feature representation of thereference subject may have been captured before a medical treatment(e.g., a drug) was administered to the reference subject (e.g., one houror one day before the medical treatment was administered). Thepost-treatment feature representation of the reference subject may havebeen captured after the medical treatment was administered to thereference subject (e.g., one hour or one day after the medical treatmentwas administered).

The representation engine 1606 processes the feature representations1604 of the reference subjects 1602 to generate conditioning data 1608,i.e., for use in conditioning the multi-modal data 1614 characterizingthe target subject 1612. Generally, the conditioning data 1608 can berepresented as an ordered collection of numerical values, e.g., avector, matrix, or other tensor of numerical values. For convenience,the collection of numerical values representing the conditioning data1608 can be understood as being organized into a set of featuredimensions.

A few example techniques by which the representation engine 1606 cangenerate the conditioning data 1608 based on the feature representations1604 of the reference subjects 1602 are described next.

In some implementations, each feature representation 1604 of eachreference subject 1602 has a set of feature dimensions, and therepresentation engine 1606 generates conditioning data 1608 thatincludes a respective correlation coefficient for each pair of featuredimensions. The correlation coefficient for a pair of feature dimensions(i, j) can represent a correlation, across the population of referencesubjects 1602, between the value of feature dimension i and the value offeature dimension j. The correlation coefficient can be any appropriatecorrelation coefficient, e.g., a Pearson correlation coefficient, or aSpearman correlation coefficient, or a Kendall correlation coefficient.In these implementations, the conditioning data 1608 can be represented,e.g., as an N × N matrix, where N is the number of feature dimensions inthe feature representations 1604 of the reference subjects 1602 andentry (i,j) of the matrix represents the correlation coefficient for thepair of feature dimensions (i,j).

A few examples of generating conditioning data as a collection ofcorrelation coefficients are described next.

In one example, the feature representation 1604 of each referencesubject 1602 can include proteomic data that defines, for each proteinin a predefined set of proteins, an expression level of the protein inthe reference subject 1602. In this example, the representation engine1606 can generate conditioning data that defines a respectivecorrelation coefficient for each pair of proteins from the set ofproteins. The correlation coefficient for a pair of proteins measures acorrelation in the expression levels of the pair of proteins across thepopulation of reference subjects 1602.

In another example, the feature representation 1604 of each referencesubject can include genomic data that defines, for each gene in apredefined set of genes, an expression level of the gene in the genomeof the reference subject 1602. In this example, the representationengine 1606 can generate conditioning data 1608 that defines arespective correlation coefficient for each pair of genes in the set ofgenes. The correlation coefficient for a pair of genes measures acorrelation in the expression levels of the pair of genes across thepopulation of reference subjects 1602.

In some implementations, to generate the conditioning data 1608, therepresentation engine 1606 computes a “differential” featurerepresentation for each reference subject 1602 as a difference between apost-treatment feature representation and a pre-treatment featurerepresentation of the reference subject 1602. For example, therepresentation engine 1606 can compute the differential featurerepresentation for each reference subject 1602 by subtracting thepre-treatment feature representation of the reference subject 1602 fromthe post-treatment feature representation 1604 of the reference subject1602. The representation engine 1606 can then generate the conditioningdata 1608 by combining the differential feature representations of thereference subjects. For example, the representation engine 1606 cancompute the value of each feature dimension of the conditioning data asa measure of central tendency of the values of a corresponding featuredimension in the differential feature representations of the referencesubjects 1602. The measure of central tendency can be, e.g., a mean, amedian, or a mode.

A few examples of generating conditioning data 1608 using pre-treatmentand post-treatment feature representations of the reference subjects1602 are described next.

In one example, for each reference subject 1602, the pre-treatment andthe post-treatment feature representations of the reference subject 1602are derived from PET imaging of the brain of the reference subject 1602.

In particular, the pre-treatment feature representation of the referencesubject can be derived from a PET image captured prior to a drug beingadministered to the reference subject. For example, the pre-treatmentfeature representation of the reference subject can define, for eachbrain region in a parcellation of the brain of the reference subject, anaverage of the intensity values of the voxels in the brain region in thepre-treatment PET image.

The post-treatment feature representation of the reference subject canbe derived from a PET image captured after the drug (which is labeledwith a tracer element) is administered to the patient. For example, thepost-treatment feature representation of the reference subject candefine, for each brain region in the parcellation of the brain of thereference subject, an average of the intensity values of the voxels inthe brain region of the post-treatment PET image.

In this example, the conditioning data 1608 may represent, for eachbrain region in the brain parcellation, a measure of penetration by thedrug into the brain region across the population of reference subjects.The conditioning data 1608 can be represented, e.g., as an N × 1 vectorof numerical values, where N is the number of brain regions in the brainparcellation, and the value of each entry i in the vector representspenetration of the drug into brain region i.

In another example, for each reference subject 1602, the pre-treatmentand the post-treatment feature representations of the reference subject1602 are each derived from fMRI imaging of the brain of the referencesubject 1602.

In particular, the pre-treatment feature representation of the referencesubject can be derived from fMRI imaging of the reference subject 1602prior to a drug being administered to the reference subject. Forexample, the pre-treatment feature representation of the referencesubject can be a functional connectivity matrix representingcorrelations between blood flow curves in brain regions of the referencesubject prior to the drug being administered.

The post-treatment feature representation of the reference subject canbe derived from fMRI imaging of the reference subject after the drug isadministered to the reference subject. For example, the post-treatmentfeature representation can be a functional connectivity matrixrepresenting correlations between blood flow curves in brain regions ofthe reference subject after the drug is administered.

In this example, the conditioning data 1608 can represent, for each pairof brain regions in a brain parcellation, a measure of change infunctional connectivity between the pair of brain regions (i.e.,measured across the population of reference patients) as a result ofadministration of the drug. The conditioning data 1608 can berepresented, e.g., as an N × N matrix of numerical values, where N isthe number of brain regions in the brain parcellation and the value ofeach entry (i,j) in the matrix represents a change in the functionalconnectivity between the pair of brain regions (i,j).

In some implementations, to generate the conditioning data 1608, therepresentation engine 1606 obtains both: (i) a feature representation,and (ii) a label, for each reference subject 1602. The label can define,e.g., whether the reference subject 1602 has a medical condition, orwhether the reference subject responded successfully to a treatment fora medical condition. The label for a reference patient can generally berepresented as a numerical value, e.g., a binary 0/1 numerical value. Inthese implementations, the representation engine 1606 can generateconditioning data 1608 that includes a respective correlationcoefficient for each feature dimension in the feature representations1604 of the reference subjects 1602. The correlation coefficient for afeature dimension can represent a correlation, across the population ofreference subjects 1602, between: (i) the value of the feature dimensionin the feature representations 1604 of the reference subjects 1602, and(ii) the labels of the reference subjects 1602. The correlationcoefficient can be any appropriate correlation coefficient, e.g., aPearson correlation coefficient, or a Spearman correlation coefficient,or a Kendall correlation coefficient. In these implementations, theconditioning data 1608 can be represented, e.g., as an N × 1 vector,where N is the number of feature dimensions in the featurerepresentations 1604 of the reference subjects 1602 and entry i of thevector represents the correlation coefficient for feature dimension i.

Optionally, the conditioning system 1600 can normalize the conditioningdata 1608, e.g., by applying a soft-max function or a sigmoid functionto the conditioning data 1608.

Generally, the conditioning system 1600 can generate the conditioningdata 1608 once, and thereafter maintain the conditioning data 1608 foruse in conditioning respective multi-modal data 1614 characterizing anyappropriate number of target subjects 1612. That is, the conditioningsystem 1600 is not required to regenerate the conditioning data 1608from the feature representations 1604 of the reference subjects 1602each time the conditioning system 1600 conditions multi-modal data 1614characterizing a target subject 1612. In particular, after generatingthe conditioning data 1608, the conditioning system 1600 can store theconditioning data 1608 in a data store, and then retrieve theconditioning data 1608 from the data store each time the conditioningsystem 1600 conditions multi-modal data 1614 characterizing a targetsubject 1612.

The conditioning engine 1610 is configured to apply the conditioningdata 1608 to the multi-modal data 1614 characterizing the target subject1612 to generate conditioned multi-modal data 1616. In particular, foreach modality in a predefined set of one or more “target” modalities,the conditioning engine 1610 applies the conditioning data 1608 to a setof feature dimensions of the multi-modal data corresponding to thetarget modality.

The set of target modalities that are conditioned using the conditioningdata 1608 can be a proper subset the full set of modalities included inthe multi-modal data 1614. Thus, certain feature dimensions of themulti-modal data 1614 may be unaffected by the application of theconditioning data 1608 to the multi-modal data 1614.

In some cases, the conditioning engine 1610 implements a cross-modalconditioning operation by using the conditioning data 1608 to conditiontarget modalities in the multi-modal data 1614 which are different thanthe reference modality used to generate the conditioning data 1608. Forexample, as will be described in more detail below, the conditioningengine 1610 can use conditioning data 1608 derived from PET imagingshowing penetration of a drug in the brains of the reference subjects tocondition fMRI data characterizing the brain of the target subject 1612.

The conditioning engine 1610 can apply the conditioning data 1608 to aset of feature dimensions of the multi-modal data 1614 in any of avariety of possible ways. A few example techniques for applying theconditioning data to a set of feature dimensions of the multi-modal data1614 are described next.

In some implementations, the conditioning data 1608 can be representedas an N × 1 vector, and the conditioning engine 1610 can apply theconditioning data 1608 to a set of N corresponding feature dimensions ofthe multi-modal data 1614. For example, the conditioning engine can addthe conditioning data 1608 to the corresponding feature dimensions ofthe multi-modal data 1614. As another example, the conditioning engine1610 can pointwise multiply the conditioning data 1608 by thecorresponding feature dimensions of the multi-modal data 1614. A fewexamples of applying conditioning data 1608 represented as an N × 1vector to a set of N corresponding feature dimensions of the multi-modaldata 1614 are described next.

In one example, the conditioning data 1608 can be an N × 1 vector ofnumerical values representing penetration of a drug into each of N brainregions in a brain parcellation. In this example, the conditioning data1608 can be applied to a set of N feature dimensions of the multi-modaldata representing features of the N brain regions in the brain of thetarget subject 1612. For example, the conditioning data 1608 can beapplied to a set of N feature dimensions of the multi-modal datarepresenting activations of the N brain regions measured in the brain ofthe target subject 1612 by fMRI. (The “activation” of a brain region, asmeasured by fMRI, can refer to, e.g., an average or a maximum value ofan average blood flow curve for the brain region). As another example,the conditioning data 1608 can be applied to a set of N featuredimensions of the multi-modal data representing electrical activity ofthe N brain regions measured in the brain of the target subject by EEGprobes.

In another example, the conditioning data 1608 can be an N × 1 vector ofcorrelation coefficients representing correlations between: (i) valuesof feature dimensions in the feature representations of the referencesubjects 1602, and (ii) labels of the reference subjects. In thisexample, the N × 1 conditioning vector can be pointwise multiplied by Ncorresponding feature dimensions of the multi-modal data 1614, e.g., Nfeature dimensions of the multi-modal data corresponding to the samemodality as the feature representations 1604 of the reference subjects1602.

In some implementations, the conditioning data 1608 can be representedas an N × N matrix, and the conditioning engine 1610 can apply theconditioning data 1608 to a set of N or N² corresponding featuredimensions of the multi-modal data 1614. (The “corresponding featuredimensions of the multi-modal data” refer to, e.g., a set of featuredimensions of the multi-modal data that are designated, by theconditioning system, to be conditioned using the conditioning data). Forexample, the conditioning engine 1610 can add or pointwise multiply theelements of the N × N conditioning matrix by N² corresponding featuredimensions of the multi-modal data 1614. As another example, theconditioning engine 1610 can matrix multiply the N × N conditioningmatrix by N corresponding feature dimensions of the multi-modal data1614 (e.g., represented as an N × 1 vector) or by N² correspondingfeature dimensions of the multi-modal data 1614 (e.g., represented as anN × N matrix).

More specifically, for example, the conditioning data 1608 can be an N ×N matrix, derived from fMRI data characterizing the reference patients,that measures changes in functional connectivity between pairs of brainregions across the population of reference patients as a result ofadministration of a drug, as described above. In this example, N canrepresent the number of brain regions in a brain parcellation. The N × Nconditioning matrix can be added to, pointwise multiplied by, or matrixmultiplied by N² corresponding feature dimensions of the multi-modaldata 1614, e.g., that represent a functional connectivity matrix derivedfrom fMRI imaging of the target subject 1612. The N × N can also bematrix multiplied by N corresponding feature dimensions of themulti-modal data 1614, e.g., that represent activations of the N brainregions measured in the brain of the target subject 1612 by fMRI.

As another example, the conditioning data 1608 can be an N × N matrix ofcorrelation coefficients representing correlations, across thepopulation of reference subjects 1602, between expression levels ofproteins in a predefined set of N proteins. In this example, the N × Nconditioning matrix can be matrix multiplied by N corresponding featuredimensions of the multi-modal data 1614, e.g., that represent expressionlevels of each of the N proteins in the target subject 1612.

As another example, the conditioning data 1608 can be an N × N matrix ofcorrelation coefficients representing correlations, measured across thepopulation of reference subjects 1602, between expression levels ofgenes in a predefined set of N genes. In this example, the N × Nconditioning matrix can be matrix multiplied by N corresponding featuredimensions of the multi-modal data 1614, e.g., that represent expressionlevels of each of the N genes in the genome of the target subject 1612.

The description above references generating conditioning data 1608derived from feature representations of the reference subjects 1602corresponding to a particular reference modality. However, it can beappreciated that the conditioning system 1600 can obtain featurerepresentations of the reference subjects 1602 corresponding to each ofmultiple reference modalities, and generate respective conditioning data1608 based on each reference modality. The conditioning system 1600 canthen condition multi-modal data 1614 characterizing a target subject1612 using the respective conditioning data 1608 derived from each ofthe multiple reference modalities.

FIG. 17 illustrates an example of the operations performed by theconditioning system 1600 described with reference to FIG. 16 . Theconditioning system obtains a respective feature representation,corresponding to a reference feature modality, for each referencesubject 1602 in a population of reference subjects 1602. Theconditioning system 1600 processes the feature representations 1604 ofthe reference subjects 1602 to generate conditioning data 1608. Theconditioning system 1600 applies the conditioning data 1608 to one ormore feature dimensions 1704 of a set of multi-modal data 1614characterizing a target subject 1612 by way of a conditioning operation1702. The conditioning operation 1702 can be, e.g., an pointwiseaddition, pointwise multiplication, or matrix multiplication operation.The conditioned multi-modal data 1616 can then be provided as an inputto a machine learning model 1618, e.g., the encoder neural network ofthe machine learning system described with reference to FIG. 1 .

FIG. 18 shows an example response estimation system 1800. The responseestimation system 1800 is an example of a system implemented as computerprograms on one or more computers in one or more locations in which thesystems, components, and techniques described below are implemented.

The response estimation system 1800 is configured to generate arespective response score 1818, relative to a drug 1802, for eachpatient category in a set of patient categories 1816. The response score1818 for a patient category 1816 can characterize a predicted responseof patients included the patient category to receiving the drug 1802.

The drug 1802 can be any appropriate substance that can be introducedinto the body of a patient to achieve a desired physiological effect,e.g., the effect of treating one or more medical conditions in thepatient, e.g., a psychiatric condition (e.g., depression, psychosis, orschizophrenia), cancer, diabetes, etc.

The “response” of a patient to receiving a drug can refer to any of avariety of changes in the condition of the patient as a result ofreceiving the drug. For example, the response of a patient to receivinga drug can characterize an amount of improvement in one or more symptomsof a medical condition in the patient that are achieved by administeringthe drug to the patient. As another example, the response of a patientto receiving a drug can characterize a level of side effects induced inthe patient by the drug.

A response score 1818 for a patient category 1816 can (implicitly orexplicitly) characterize any appropriate aspect of a predicted responseof patients included in the patient category 1816 to receiving the drug.A few example applications of response scores 1818 generated by theresponse estimation system 1800 are described in more detail below.

A patient category 1816 can refer to a classification of patient, e.g.,such that any patient can be classified as being included in arespective patient category from the set of patient categories 1816. Forinstance, a patient can be classified as being included in a patientcategory based on a set of multi-modal data characterizing the patient.Patients included in the same patient category may share similarfeatures and characteristics. An example process for generating patientcategories is described in detail with reference to FIG. 5 , and anexample process for classifying patients into patent categories isdescribed in more detail with reference to FIG. 7A.

The response estimation system 1800 generates the response scores 1818using a signature engine 1808, an encoder neural network 104, and aresponse engine 1814, which are each described in more detail next.

The signature engine 1808 is configured to receive, for each entity 1804in a population of entities: (i) a respective “pre-treatment” featurerepresentation 1806 of the entity 1804, and (ii) a respective“post-treatment” feature representation 1806 of the entity 1804.

Each entity 1804 can be, e.g., a cell, a collection of cells, or apatient. The population of entities can include any appropriate numberof entities, e.g., one entity, 1,000 entities, or 10,000 entities. Insome cases, the population of entities can cells corresponding todifferent tissue types, e.g., liver tissue, kidney tissue, brain tissue,etc.

Generally, a feature representation 1806 of an entity (e.g., apre-treatment feature representation or a post-treatment featurerepresentation) refers to a collection of features characterizing theentity that collectively represent data from one or more modalities. Afeature representation 1806 of an entity 1804 can represent data fromany appropriate modalities, e.g., an fMRI modality, a PET modality, agenomic data modality, a proteomic data modality, etc.

A feature representation 1806 of an entity can be represented, e.g., asan ordered collection of numerical values, e.g., a vector, matrix, orother tensor of numerical values. For convenience, the features in afeature representation 1806 for an entity can be understood as beingindexed by a set of feature dimensions, where each feature dimension ofthe feature representation 1806 is associated with a value of arespective feature of the entity.

A pre-treatment feature representation 1806 of an entity refers to arepresentation of the entity that was captured (e.g., by one or moresensors) prior to the drug 1802 being administered to the entity.Conversely, a post-treatment feature representation 1806 of an entityrefers to a representation of the entity that was captured (e.g., by oneor more sensors) after the drug was applied to the entity (e.g.., oneminute, one hour, or one day after the drug was applied to the entity).

In some cases, for each entity 1804, the pre-treatment featurerepresentation 1806 of the entity includes the same features as thepost-treatment feature representation 1806 of the entity. However, thevalue of any given feature may differ between the pre-treatment featurerepresentation 1806 and the post-treatment feature representation 1806of an entity, e.g., at least in part because of the impact of the drug1802 on the entity.

The drug 1802 can be administered to entities 1804 in the population ofentities 1804 in any appropriate manner. In some cases, the entities arepatients; a drug can be administered to a patient, e.g., by aninjection, rectally, orally, or topically. In some cases, the entitiesare cells or collections of cells; a drug can be administered to a cellor a collection of cells, e.g., by introducing the drug into theenvironment of the cells, e.g., in vivo (e.g., in a biologicalenvironment) or in vitro (e.g., in an artificial environment, e.g., in atest tube or petri dish).

The signature engine 1808 is configured to process the pre-treatment andpost-treatment feature representations 1806 of the entities 1804 in thepopulation of entities to generate a drug signature 1810. The drugsignature 1810 includes a respective impact score for each featureincluded in the feature representations 1806. The impact score for afeature characterizes an impact, caused by administering the drug 1802to the entities 1804, on the value of feature measured for the entities1804. For instance, the impact score for a first feature being higherthan the impact score for a second feature can indicate thatadministering the drug has a higher impact on the value of the firstfeature than on the value of the second feature.

The drug signature 1810 can be represented as an ordered collection ofnumerical values (in particular, impact scores), e.g., as a vector,matrix, or other tensor of numerical values. The impact scores in thedrug signature can be indexed by the same set of feature dimensions asthe feature representations 1806; in particular, the impact scoreindexed by a feature dimension represents the impact score for thefeature corresponding to the feature dimension.

The signature engine 1808 can process the pre-treatment andpost-treatment feature representations 1806 to generate the drugsignature 1810 in any of a variety of ways. For instance, to generatethe drug signature, the signature engine 1808 can generate a respective“differential” feature representation for each entity 1804, e.g., as adifference between the pre-treatment feature representation 1806 of theentity 1804 and the post-treatment feature representation 1806 of theentity 1804. That is, the signature engine 1808 can generate adifferential feature representation for an entity by element-wisesubtracting the post-treatment feature representation for the entityfrom the pre-treatment representation for the entity (or vice versa).

The signature engine 1808 can generate a respective entity-specific drugsignature for each entity 1804 based on the differential featurerepresentation 1806 of the entity. A few example techniques forgenerating an entity-specific drug signature for an entity based on thedifferential feature representation for the entity are described next.

In some implementations, the signature engine 1808 can generate anentity-specific drug signature for an entity by element-wise dividingthe differential feature representation for the entity by thepre-treatment feature representation for the entity. Thus, in thisexample, the impact score for each feature represents a fractionalchange in the value of the feature between the pre-treatment featurerepresentation and the post-treatment feature representation.

In some implementations, the signature engine 1808 can generate anentity-specific drug signature for an entity by applying a non-linearactivation function to the differential feature representation for theentity. The non-linear activation function can be, e.g., a sigmoidactivation function, a soft-max activation function, or any otherappropriate activation function.

In some implementations, the signature engine 1808 can define anentity-specific drug signature for an entity as being the differentialfeature representation of the entity.

The signature engine 1808 can combine the entity-specific drugsignatures for the entities 1804 in population of entities 1804 togenerate the overall drug signature 1810. For example, for each featurerepresented in the feature representations 1806, the signature engine1808 can generate the impact score for the feature in the overall drugsignature 1810 as a measure of central tendency of the impact scores forthe feature in the entity-specific drug signatures. The measure ofcentral tendency can be, e.g., a mean, a median, or a mode. In someimplementations, where the population of entities includes only a singleentity, the signature engine 1808 can designate the entity-specific drugsignature for the entity as being the overall drug signature 1810.

The impact factors in the drug signature 1810 can correspond to featuresmeasured using any appropriate modalities. A few examples of possiblemodalities associated with impact factors in the drug signature 1810 aredescribed next.

In some implementations, one or more of the impact factors in the drugsignature 1810 are each associated with a feature measuring anexpression level of a respective gene. An impact factor associated witha feature measuring an expression level of a gene characterizes animpact, caused by administering the drug to the entities 1804, on theexpression level of the gene in the entities 1804.

In some implementations, one or more of the impact factors in the drugsignature 1810 are each associated with a feature measuring anexpression level of a respective protein. An impact factor associatedwith a feature measuring an expression level of a protein characterizesan impact, caused by administering the drug to the entities 1804, on theexpression level of the protein in the entities 1804.

In some implementations, one or more of the impact factors in the drugsignature 1810 are each associated with a feature measuring an amount ofa radiotracer (e.g., a radioactive substance tagged to a drug) in arespective brain region. An impact factor associated with a featuremeasuring an amount of a radiotracer in a brain region characterizes animpact, caused by administering the drug to the entities 1804, on theamount of the radiotracer in the brain region of the entities 1804. Inthese implementations, the drug administered to the entities can betagged with the radiotracer. The amount of the radiotracer in a brainregion in the brain of a patient can be measured, e.g., from PET imagingof the brain of the patient. For example, the amount of a radiotracer ina brain region can be determined as a combination (e.g., an average orsum) of the intensity values of voxels included in the brain region in aPET image of the brain.

In some implementations, one or more of the impact factors in the drugsignature 1810 are each associated with a feature measuring an amount ofblood flow in a respective brain region. An impact factor associatedwith a feature measuring an amount of blood flow in a brain regioncharacterizes an impact, caused by administering the drug to theentities 1804, on the amount of blood flow in the brain region in theentities 1804. In these implementations, the amount of blood flow in abrain region of the brain of a patient can be measured, e.g., from fMRIimaging of the brain of the patient. For example, the amount of bloodflow in a brain region can be determined as a combination (e.g., anaverage or sum) of the intensity values of voxels included in the brainregion in an fMRI image of the brain.

The encoder neural network 104 is a neural network that has been trainedto process multi-modal data characterizing a patient to generate anembedding of the multi-modal data in a latent space. An examplearchitecture of the encoder neural network is described above withreference to FIG. 2 . Prior to being used by the archetype generationsystem 300B, the encoder neural network 104 can be jointly trained,along with a decoder neural network 108, e.g., by the training system900 described with reference to FIG. 9 .

The response estimation system 1800 generates a network input to theencoder neural network 104 based on the drug signature 1810. Morespecifically, the network input to the encoder neural network 104includes an ordered collection of features that is indexed by a set offeature dimensions, i.e., such that each feature in the network input isindexed by a unique feature dimension in the set of feature dimensions.To generate the network input to the encoder neural network 104, theresponse estimation system 1800 associates each impact score in the drugsignature with a respective feature dimension of the network input. Foreach feature dimension of the network input that is associated with arespective impact score in the drug signature 1810, the responseestimation system 1800 defines the value of the feature dimension of thenetwork input as being the value of the corresponding impact score fromthe drug signature 1810.

The response estimation system 1800 can associate the impact scores inthe drug signature 1810 with corresponding feature dimensions of thenetwork input to the encoder neural network 104 in accordance withpredefined rules. For instance, in some cases, an impact score in thedrug signature 1810 can correspond to a feature that is included in thenetwork input. In these cases, the response estimation system 1800 canassociate the impact score with the feature dimension of thecorresponding feature in the network input.

In some cases, one or more of the impact scores in the drug signature1810 can correspond to features from modalities that are not included inthe network input to the encoder neural network 104. In these cases, theresponse estimation system 1800 can perform a cross-modal assignment ofimpact scores from the drug signature to corresponding featuredimensions of the network input. In particular, the response estimationsystem 1800 can assign an impact score for a feature corresponding toone modality in the drug signature 1810 to a feature dimensioncorresponding to a different modality in the network input to theencoder neural network 104.

For example, the drug signature 1810 may include an impact score for aPET feature measuring an amount of radiotracer in a brain region, whilethe network input to the encoder neural network 104 may not include anyPET features. In this example, the response estimation system 1800 mayassociate the impact score for the PET feature of the brain region inthe drug signature 1810 with a feature dimension of the network inputthat corresponds to an fMRI feature measuring blood flow in the brainregion. Thus the response estimation system 1800 can perform across-modal assignment of an impact score corresponding to a PET featureto a feature dimension corresponding to an fMRI feature in the networkinput to the encoder neural network 104.

In some implementations, the number of impact scores in the drugsignature 1810 may be less than the number of feature dimensions of thenetwork input to the encoder neural network 104. Put another way,certain feature dimensions of the network input to the encoder neuralnetwork 104 may be “undefined,” i.e., as a result of not be associatedwith corresponding impact scores from the drug signature 1810.

The response estimation system 1800 can address the issue of undefinedfeature dimensions in the network input to the encoder neural network104 in a variety of possible ways. For example, the response estimationsystem 1800 can set the undefined feature dimensions of the networkinput to default values, e.g., zero. As another example, the responseestimation system can set the value of each undefined feature dimensionof the network input to a measure of central tendency (e.g., a mean,median, or mode) of feature values corresponding to the featuredimension for entities (e.g., cells or subjects) in a population ofentities. As another example, the architecture of the encoder neuralnetwork may explicitly account for the possibility that the values ofone or more feature dimensions are undefined. For instance, the encoderneural network architecture described with reference to FIG. 2 includesa respective subnetwork corresponding to each of multiple modalities. Ifthe features corresponding to a particular modality are undefined in thenetwork input, the encoder neural network disables the operations of thesubnetwork corresponding to the modality, as described above withreference to FIG. 2 .

The encoder neural network 104 processes the network input based on thedrug signature 1810, in accordance with values of the encoder neuralnetwork parameters, to generate an embedding 1812 of the drug signaturein a latent space.

The response engine 1814 is configured to process: (i) the drugsignature embedding 1812, and (ii) data defining the patient categories1816, to generate a respective response score 1818 for each patientcategory 1816. The response score 1818 for a patient category 1816characterizes a predicted response of patients included in the patientcategory to the drug 1802. (That is, the response score for a patientcategory can, in some cases, implicitly or explicitly encode informationrelevant to the response of patients in the patient category to thedrug).

Each patient category 1816 can be represented by a cluster of patientembeddings in the latent space, where each patient embedding correspondsto a respective patient and is generated by processing multi-modal datacharacterizing the patient using the encoder neural network 104. Anexample process for generating patient categories is described in moredetail with reference to FIG. 5 .

The response engine 1814 can generate the response score 1818 for apatient category 1816 in any of a variety of possible ways. A fewexample techniques for generating a response score 1818 for a patientcategory 1816 are described next.

In some implementations, the response engine 1814 defines the responsescore 1818 for a patient category 1816 as being a result of evaluating asimilarity measure between: (i) the drug signature embedding 1812, and(ii) an embedding representing the patient category 1816. The embeddingrepresenting the patient category 1816 can be, e.g., a centroid of thepatient embeddings in the patient category 1816, an average of thepatient embeddings in the patient category 1816, or an archetypeembedding representing the patient category 1816 (as described abovewith reference to FIG. 3B). The similarity measure can be, e.g., acosine similarity measure, an L₁ similarity measure, an L₂ similaritymeasure, or any other appropriate similarity measure.

In some implementations, to generate the response score 1818 for apatient category 1816, the response engine 1814 evaluates a respectivesimilarity measure between: (i) the drug signature embedding 1812, and(ii) each patient embedding included in the patient category 1816. Theresponse engine 1814 can then define the response score 1818 as being ameasure of central tendency of the similarity measures between the drugsignature embedding 1812 and the patient embeddings included in thepatient category 1816. The measure of central tendency can be, e.g., amean, a median, or a mode. The similarity measure can be, e.g., a cosinesimilarity measure, an L₁ similarity measure, an L₂ similarity measure,or any other appropriate similarity measure.

In some cases, the response estimation system 1800 generates theresponse scores 1818 for the patient categories 1816 using the drugsignature 1810, without embedding the drug signature 1810 in the latentspace. A few example techniques for generating a response score 1818 fora patient category 1816 without embedding the drug signature 1810 in thelatent space are described next.

In some implementations, the response engine 1814 defines the responsescore 1818 for a patient category 1816 as being a result of evaluating asimilarity measure between: (i) the drug signature 1810, and (ii) a setof multi-modal data representing the patient category 1816. The set ofmulti-modal data representing the patient category 1816 can be, e.g., acentroid of multi-modal data tensors for patients included in thepatient category 1816, an average of multi-modal data tensors forpatients included in the patient category 1816, or a multi-modal datatensor corresponding to an archetype embedding representing the patientcategory 1816 (as described above with reference to FIG. 3B). Thesimilarity measure can be, e.g., a cosine similarity measure, an L₁similarity measure, an L₂ similarity measure, or any other appropriatesimilarity measure.

In some implementations, to generate the response score 1818 for apatient category 1816, the response engine 1814 evaluates a respectivesimilarity measure between: (i) the drug signature 1810, and (ii) arespective multi-modal data tensor for each patient included in thepatient category 1816. The response engine 1814 can then define theresponse score 1818 as being a measure of central tendency of thesimilarity measures between the drug signature 1810 and the multi-modaldata tensors for the patients included in the patient category 1816. Themeasure of central tendency can be, e.g., a mean, a median, or a mode.The similarity measure can be, e.g., a cosine similarity measure, an L₁similarity measure, an L₂ similarity measure, or any other appropriatesimilarity measure.

The response estimation system 1800 can use the response scores 1818 forthe patient categories 1816 in any of a variety of possibleapplications. A few examples of possible applications of the responsescores 1818 are described next.

In some implementations, the response estimation system 1800 uses theresponse scores 1818 to generate a ranking of the patient categories1816. For instance, the response estimation system 1800 can generate aranking of the patient categories 1816 from highest response score tolowest response score, or from lowest response score to highest responsescore.

In some implementations, the response estimation system 1800 uses theresponse scores 1818 to define a treatment criterion for selecting acourse of medical treatment for a patient. For example, the treatmentcriterion may be that a patient is included in a patient category with acorresponding response score 1818 that satisfies (e.g., exceeds, or doesnot exceed) a threshold. The response estimation system 1800 cangenerate a recommendation that a patient should receive the drug 1802based at least in part on whether the treatment criterion is satisfiedfor the patient.

Optionally, as an alternative to or in combination with generatingresponse scores 1818 for patient categories 1816, the responseestimation system 1800 can generate response scores 1818 for individualpatients (“patient-specific” response scores). The response estimationsystem 1800 can define a patient-specific response score for a patient,e.g., as a measure of similarity between: (i) the drug signatureembedding 1812, and (ii) a patient embedding for the patient. (Theresponse estimation system 1800 can generate the patient embedding forthe patient, e.g., by processing multi-modal data characterizing thepatient using the encoder neural network 104, as described above). Apatient-specific response score for a patient can characterize apredicted response of the patient to the drug 1802. A patient-specificresponse score for a patient can be used to define a treatment criterionfor selecting a course of medical treatment for the patient, e.g., asdescribed above with reference to response scores 1818 for patientcategories 1816.

The encoder neural network 104 can be jointly trained along with adecoder neural network by a training system, e.g., the training systemdescribed with reference to FIG. 9 . The training system can incorporatethe drug signature embedding 1812 into the training of the encoder anddecoder neural networks, as will be described next.

As described with reference to FIG. 9 , the training system can jointlytrain the encoder neural network and the decoder neural network tooptimize an objective function that includes a reconstruction loss thatmeasures errors in reconstructed multi-modal data generated by thedecoder neural network. The reconstruction loss can include multiplescaling factors that each scale a respective term in the reconstructionloss that measures an error in a corresponding proper subset of thefeature dimensions of the reconstructed multi-modal data generated bythe decoder neural network. Thus each scaling factor controls therelative importance of the error in a corresponding proper subset of thefeature dimensions of the reconstructed multi-modal data to thecalculation of the overall error in the reconstructed multi-modal data.An example of a reconstruction loss is provided in equation (11), wherethe

{β_(i)}_(i = 1)^(n)

represent the scaling factors.

The value of each scaling factor in the reconstruction loss can be setbased on a relevance of the corresponding proper subset of the featuredimensions of the multi-modal data to a particular medical condition,e.g., a medical condition that is treated by the drug 1802. The trainingsystem can use the drug signature embedding 1812 to dynamically adjustone or more of the scaling factors in the reconstruction loss functionduring training. For instance, the training system can uniformlyincrease the values of the scaling factors, e.g., to increase therelative importance of the reconstruction loss relative to other partsof the objective function, e.g., the archetype loss, the clusteringloss, or the prior loss. As another example, the training system canincrease the values of certain designated scaling factors relative toother scaling factors, e.g., to increase the relative importance of theerrors in certain subsets of the reconstructed multi-modal data to thecalculation of the overall error in the reconstructed multi-modal data.

For example, at each of one or more training iterations, the trainingsystem can use the drug signature embedding 1812 to generate aninfluence score, e.g., that characterizes an influence of the scalingfactors in the reconstruction loss on the semantic structure of thelatent space. The training system can then increase one or more of thescaling factors in the reconstruction loss, over a sequence of trainingiterations, until the influence score satisfies (e.g., exceeds) athreshold.

The training system can generate the influence score at a trainingiteration in any of a variety of ways. For instance, to generate theinfluence score, the training system can generate a current drugsignature embedding, i.e., in accordance with the current values of theencoder neural network parameters at the training iteration. Thetraining system can then determine a respective similarity measurebetween the drug signature embedding and each of multiple “reference”embeddings in the latent space. The training system can then define theinfluence score, e.g., as the maximum of the similarity measures betweenthe drug signature embedding and the reference embeddings.

The reference embedding can be any appropriate embeddings in the latentspace. For instance, the reference embeddings can include a respectiveembedding representing each patient category in the latent space as ofthe training iteration. An embedding representing a patient category canbe, e.g., a centroid of the patient embeddings in the patient category,an average of the patient embeddings in the patient category, or anarchetype embedding representing the patient category (as describedabove with reference to FIG. 3B).

Intuitively, as the scaling factors for the reconstruction loss increaseduring training, the semantic structure of the latent space will adaptto increasingly emphasize information relevant to the medical conditiontreated by the drug 1802. As the latent space increasingly adapts toemphasize information relevant to the medical condition treated by thedrug 1802, the reference embeddings in the latent space may increasinglyreorient toward the drug signature embedding 1812 (which itself encodesinformation relevant to the medical condition, in particular, therelative impact of a drug treating the medical condition on respectivepatient features). The similarity between the drug signature embeddingand the reference embeddings provides a measure of the influence of thescaled reconstruction loss on the semantic structure of the latentspace.

FIG. 19 illustrates an example of computing a drug signature 1910 basedon gene expression in a cell 1902. The pre-treatment featurerepresentation of the cell 1902 includes the pre-treatment geneexpression data 1906, e.g., that measures a respective level ofexpression of each of multiple genes in the cell prior the drug 1904being applied to the cell 1902. The post-treatment featurerepresentation of the cell 1902 includes the post-treatment geneexpression data 1908, e.g., that measures a respective level ofexpression of multiple genes in the cell after the drug 1904 has beenapplied to the cell 1902. The drug signature 1910 for the drug 1904 canbe based at least in part on a difference between the pre-treatment geneexpression levels and the post-treatment gene expression levels, asdescribed above in more detail with reference to FIG. 18 .

FIG. 20 illustrates examples of response scores for patient categories.In particular, FIG. 20 illustrates two patient categories 2002-A-B. Eachpatient category is represented by a respective cluster of patientembeddings in a latent space. Each patient embedding in the cluster ofpatient embeddings representing patient category 2002-A is representedby a light colored circle (e.g., 2010), and each patient embedding inthe cluster of patient embeddings representing patient category 2002-Bis represented by a dark colored circle. Each patient embeddingcorresponds to a respective patient and is generated by processingmulti-modal data characterizing the patient using an encoder neuralnetwork. The drug signature embedding 2006 is an embedding of a drugsignature for a drug 2008. The drug signature embedding 2006 isgenerated by processing a network input based on the drug signatureusing the encoder neural network. The drug signature embedding 2006 canbe used to generate respective response scores 2004-A-B for the patientcategories 2002-A-B. A response score for a patient category cancharacterize a predicted response of patients included in the patientcategory to the drug, as described in more detail above with referenceto FIG. 18 .

FIG. 21 is a flow diagram of an example process 2100 for classifying apatient as being included in a patient category. For convenience, theprocess 2100 will be described as being performed by a system of one ormore computers located in one or more locations. For example, a patientclassification system, e.g., the patient classification system 700 ofFIG. 7A, appropriately programmed in accordance with this specification,can perform the process 2100.

The system receives multi-modal data characterizing a patient (2102).The multi-modal data includes a respective feature representation foreach of multiple modalities.

The system processes the multi-modal data characterizing the patientusing an encoder neural network to generate an embedding of themulti-modal data characterizing the patient (2104). For example, thesystem can process the respective feature representation for eachmodality using a corresponding encoder subnetwork of the encoder neuralnetwork to generate a respective encoder subnetwork output. The systemcan then combine the encoder subnetwork outputs to generate theembedding of the multi-modal data characterizing the patient.

The system determines a respective classification score for each patientcategory in a set of patient categories based on the embedding of themulti-modal data characterizing the patient (2106). The set of patientcategories can be determined by the patient clustering system, e.g., asdescribed with reference to FIG. 5 .

The system classifies the patient as being included in a correspondingpatient category from the set of patient categories based on theclassification scores (2108). For example, the system can classify thepatient as being included in the patient category with the highestclassification score.

FIG. 22 is a flow diagram of an example process 2200 for generating amulti-modal data archetype and a corresponding archetype representationfor each dimension of a latent space. For convenience, the process 2200will be described as being performed by a system of one or morecomputers located in one or more locations. For example, an archetypegeneration system, e.g., the archetype generation systems 300A-B ofFIGS. 3 , appropriately programmed in accordance with thisspecification, can perform the process 2200.

The system obtains a set of training examples (2202). Each trainingexample corresponds to a respective patient and includes multi-modaldata, having a set of feature dimensions, that characterizes thepatient.

The system jointly trains an encoder neural network and a decoder neuralnetwork on the set of training examples (2204). The encoder neuralnetwork is configured to process input multi-modal data characterizingan input patient to generate an embedding of the input multi-modal datain a multi-dimensional latent space. The decoder neural network isconfigured to process the embedding of the input multi-modal data togenerate a reconstruction of the input multi-modal data. An exampleprocess for training an encoder neural network and a decoder neuralnetwork on a set of training examples is described with reference toFIG. 23 .

The system generates a set of multi-modal data archetypes (2206).

In some implementations, each multi-modal data archetype corresponds toa respective dimension of the latent space. In particular, for eachdimension of the latent space, the system processes a predefinedembedding that represents the dimension of the latent space using thedecoder neural network to generate multi-modal data that defines themulti-modal data archetype for the dimension of the latent space.

In some implementations, the system processes the multi-modal trainingdata from each training example using the encoder neural network togenerate a set of multi-modal data embeddings in the latent space. Thesystem processes the set of embeddings to generate a set of regionparameters that define a region enclosing the set of embeddings in thelatent space, e.g., the region can be a convex hull of the set ofembeddings. The system then generates the set of multi-modal dataarchetypes based on: (i) the set of embeddings, and (ii) the regionenclosing the set of embeddings in the latent space.

The system generates a respective archetype representation of eachmulti-modal data archetype (2208). To generate an archetyperepresentation of a multi-modal data archetype, the system generates arespective intensity score for each feature dimension of the multi-modaldata archetype based on: (i) a value of the feature dimension of themulti-modal data archetype, and (ii) a distribution defined by values ofthe feature dimension of multi-modal data included in the set oftraining examples. The archetype representation of the multi-modal dataarchetype includes the respective intensity score for each of theplurality of feature dimensions of the multi-modal data archetype.

FIG. 23 is a flow diagram of an example process 2300 for jointlytraining an encoder neural network and a decoder neural network. Forconvenience, the process 2300 will be described as being performed by asystem of one or more computers located in one or more locations. Forexample, a training system, e.g., the training system 900 of FIG. 9 ,appropriately programmed in accordance with this specification, canperform the process 2300.

The system receives a set of training examples (2302). For example, thesystem can sample the training examples from a set of training data thatincludes multiple training examples. Each training example correspondsto a respective patient and includes multi-modal data, having a set offeature dimensions, that characterizes the patient.

For each training example, the system processes the multi-modal datafrom the training example using the encoder neural network, inaccordance with current values of the encoder parameters, to generate anembedding of the multi-modal data from the training example (2304).

For each training example, the system processes the embedding of themulti-modal data from the training example using the decoder neuralnetwork, in accordance with current values of the decoder parameters, togenerate a reconstruction of the multi-modal data from the trainingexample (2306).

The system updates the current values of the set of encoder parametersand the current values of the set of decoder parameters using gradientsof an objective function (2308). The objective function includes areconstruction loss function, and optionally one or more of: anarchetype loss function, a clustering loss function, or a prior lossfunction.

The reconstruction loss function measures, for each training example, anerror in the reconstruction of the multi-modal data from the trainingexample. In particular, the reconstruction loss function includes a setof scaling factors that each scale a respective term in thereconstruction loss function that measures an error in thereconstruction of a corresponding proper subset of the featuredimensions of the multi-modal data from the training example. Each ofthe scaling factors has a respective value that is based on a relevanceof the corresponding proper subset of the feature dimensions of themulti-modal data from the training example to a particular medicalcondition. The system updates the current values of the set of encoderparameters and the current values of the set of decoder parameters usinggradients of the reconstruction loss function.

In implementations where the objective function includes the archetypeloss function, one or more dimensions of the latent space are “anchored”dimensions that are associated with a respective target multi-modal dataarchetype. For each anchored dimension of the latent space, the systemprocesses a predefined embedding that represents the anchored dimensionusing the decoder neural network to generate multi-modal data thatdefines a predicted multi-modal data archetype corresponding to theanchored dimension. For each anchored dimension of the latent space, thearchetype loss function measures an error between: (i) the predictedmulti-modal data archetype corresponding to the anchored dimension, and(ii) the target multi-modal data archetype corresponding to the anchoreddimension. The system updates the current values of the set of decoderparameters using gradients of the archetype loss function.

An example process for evaluating the clustering loss function isdescribed in more detail with reference to FIG. 24 .

FIG. 24 is a flow diagram of an example process 2400 for determining aclustering loss during joint training of an encoder neural network and adecoder neural network. For convenience, the process 2400 will bedescribed as being performed by a system of one or more computerslocated in one or more locations. For example, a cluster hardeningsystem, e.g., the cluster hardening system 1200 of FIG. 12 ,appropriately programmed in accordance with this specification, canperform the process 2400.

The system receives a respective embedding, generated by the encoderneural network, of the multi-modal data included in each trainingexample (2402).

The system clusters the embeddings into multiple clusters of embeddings(2404). The system can cluster the embeddings by applying an appropriateclustering operation to the embeddings, e.g., a k-means clusteringoperation. Each embedding is associated with: (i) a cluster label thatidentifies a cluster that includes the embedding, and optionally, (ii) aset of confounding features.

In some implementations, the system designates a proper subset of theembeddings as being training embeddings, and trains a classificationmachine learning model to process each training embedding to predict thecluster label of the training embedding (2406). The system can thendesignate a proper subset of the embeddings as validation embeddings,and evaluate a classification accuracy of the classification machinelearning model on a task of processing each validation embedding topredict the cluster label of the validation embedding (2408).

In some implementations, the system designates a proper subset of theembeddings as being training embeddings, and trains a classificationmachine learning model to process the set of confounding featurescorresponding to each training embedding to predict the cluster label ofthe training embedding (2410). The system can then designate a propersubset of the embeddings as validation embeddings, and evaluate aclassification accuracy of the classification machine learning model ona task of processing the set of confounding features corresponding toeach validation embedding to predict the cluster label of the validationembedding (2412).

The system determines a clustering loss based on the respectiveclassification accuracy of each classification machine learning model(2414). For example, the system can determine the clustering loss as alinear combination of the classification accuracies of theclassification machine learning models.

FIG. 25 is a flow diagram of an example process 2500 for conditioningmulti-modal data characterizing a target subject based on conditioningdata derived from a population of reference subjects. For convenience,the process 2500 will be described as being performed by a system of oneor more computers located in one or more locations. For example, aconditioning system, e.g., the conditioning system 1600 of FIG. 16 ,appropriately programmed in accordance with this specification, canperform the process 2500.

The system receives multi-modal data characterizing a target subject(2502). The multi-modal data characterizing the target subject includesa respective feature representation for each of a plurality of targetmodalities.

The system receives, for each reference subject in a population ofreference subjects, a feature representation of the reference subjectcorresponding to a reference modality and having a plurality of featuredimensions (2504). In some cases, the system receives a pre-treatmentfeature representation of each reference subject captured before amedical treatment is applied to the reference subject, and apost-treatment feature representation of each reference subject capturedafter the medical treatment is applied to the reference subject.

The system generates the conditioning data based on the featurerepresentations of the reference subjects (2506). For example, thesystem can determine, for each pair of feature dimensions including afirst feature dimension and a second feature dimension, a respectivecorrelation coefficient for the pair of feature dimensions that measuresa correlation between: (i) a value of the first feature dimension in thefeature representations of the reference subjects, and (ii) a value ofthe second feature dimension in the feature representations of thereference subjects.

The system applies the conditioning data to the multi-modal datacharacterizing the target subject (2508). For example, the system canpointwise multiply each of multiple feature dimensions of themulti-modal data by corresponding dimensions of the conditioning data.

After applying the conditioning data to the multi-modal datacharacterizing the target subject, the system processes the multi-modaldata characterizing the target subject using a machine learning model togenerate a machine learning model output for the target subject (2510).For example, the system can process the multi-modal data characterizingthe target subject using the encoder neural network described withreference to FIG. 1 .

FIG. 26 is a flow diagram of an example process 2600 for generating aclinical recommendation for medical treatment of a patient. Forconvenience, the process 2600 will be described as being performed by asystem of one or more computers located in one or more locations. Forexample, a recommendation system, e.g., the recommendation system 800 ofFIG. 8 , appropriately programmed in accordance with this specification,can perform the process 2600.

The system receives multi-modal data characterizing a patient (2602).The multi-modal data includes a respective feature representation foreach modality in a set of modalities.

The system processes the multi-modal data characterizing the patientusing a machine learning model, in accordance with values of a set ofmachine learning model parameters, to generate a patient classificationthat classifies the patient as being included in a patient category froma set of patient categories (2604). For example, the system cangenerate, by the machine learning model, a respective classificationscore for each patient category in the set of patient categories. Thesystem can then classify the patient as being included in the patientcategory based on the classification scores.

The system determines an uncertainty measure that characterizes anuncertainty of the patient classification generated by the machinelearning model (2606). For example, the system can process theclassification scores for the patient categories to identify a trust setfor the patient. The trust set specifies one or more patient categoriesthat form a proper subset of the set of patient categories, where thepatient is predicted to be included in a patient category within thetrust set with at least a threshold probability. The system determinesthe uncertainty measure based on the trust set for the patient.

The system generates a clinical recommendation for medical treatment ofthe patient based on: (i) the patient classification, and (ii) theuncertainty measure that characterizes the uncertainty of the patientclassification (2608). For example, the system can evaluate a confidencecriterion based at least in part on the uncertainty measure thatcharacterizes the uncertainty of the patient classification. In responseto determining that the confidence criterion is satisfied, the systemcan generate the clinical recommendation for the patient based on thepatient classification.

FIG. 27 is a flow diagram of an example process 2700 for generating arespective response score for each patient category in a set of patientcategories. For convenience, the process 2700 will be described as beingperformed by a system of one or more computers located in one or morelocations. For example, a response estimation system, e.g., the responseestimation system 1800 of FIG. 18 , appropriately programmed inaccordance with this specification, can perform the process 2700.

The system generates a drug signature for a drug (2702). The drugsignature includes a respective impact score for each of multiplefeatures. The impact score for a feature characterizes an impact, causedby administering a drug to one or more entities, on a value of thefeature measured for the one or more entities.

The system generates an embedding of the drug signature in a latentspace (2704). In particular, the system generates a network input to anencoder neural network based on the drug signature. The system processesthe network input generated based on the drug signature using theencoder neural network to generate the embedding of the drug signaturein the latent space.

The system processes: (i) the embedding of the drug signature in thelatent space, and (ii) data defining a set of patient categories, togenerate a set of response scores (2706). Each response scorecorresponds to a respective patient category and characterizes apredicted response of patients included in the patient category to thedrug.

FIGS. 28-33 show examples of experimental results achieved throughapplying the machine learning system described in this specification tomulti-modal data for a population of patients, including at least somepatients having amyotrophic lateral sclerosis (ALS). The multi-modaldata for the patients included gene expression data and clinical data(e.g., demographic features, family history features, site of onsetfeatures, severity features, grip strength features, and respiratoryfunction features).

FIG. 28 shows an example of 12 multi-modal data archetypes (labeled onthe horizontal axis as “X0,” “X1,” ..., “X11”) relative to a set ofmulti-modal features (labeled on the vertical axis as “Clinical_age,”“Clinical_white,” etc.). The shade of each cell shown in FIG. 28represents an intensity score (e.g., as described with reference toFIGS. 3 ) of a respective feature in a corresponding archetype.

FIGS. 29A-B show an example of clustering the patients in the populationof patients. More specifically, FIG. 29A shows a two-dimensionalvisualization of the distribution of the patients in the respectiveclusters, and FIG. 29B shows the number of patients categorized as beingincluded in each cluster.

FIGS. 30A-B and FIGS. 31A-B show examples of the distribution offeatures for patients within clusters. More specifically, for each offive clusters, FIG. 30A shows a distribution of El Escorial criteriavalues for patients in the cluster, FIG. 30B shows a distribution ofRevised Amyotrophic Lateral Sclerosis Functional Rating Scale (ALSFRS-R)values for patients in the cluster, FIG. 31A shows a distribution ofCSNK1D gene expression for patients in the cluster, and FIG. 31B shows adistribution of CSNK1E gene expression for patients in the cluster.

FIGS. 32-33 illustrate that the machine learning system is more likelyto identify clusters of patients that differentiate along multiplefeature dimensions when the machine learning system processesmulti-modal patient data (in this case, a combination of gene expressiondata and clinical data) instead of unimodal patient data (in this case,gene expression data alone or clinical data alone).

More specifically, FIG. 32A shows a chart that illustrates the extent towhich patient clusters identified by the machine learning systemdifferentiate along clinical feature dimensions when the machinelearning system processes unimodal patient data, in particular, geneexpression data alone. The length of the bar associated with eachfeature reflects the extent to which clusters can be differentiated withreference to that feature. It will be appreciated that, in this case,the clusters do not differentiate along clinical feature dimensions.

FIG. 32B shows a chart that illustrates the extent to which patientclusters identified by the machine learning system differentiate alongclinical feature dimensions when the machine learning system processesmulti-modal patient data, in particular, gene expression data andclinical data. The length of the bar associated with each featurereflects the extent to which clusters can be differentiated withreference to that feature. It will be appreciated that, in this case,the clusters are differentiated along many clinical feature dimensions.

FIG. 32C shows a chart that illustrates the extent to which patientclusters identified by the machine learning system differentiate alongclinical feature dimensions when the machine learning system processesunimodal patient data, in particular, clinical data alone. The length ofthe bar associated with each feature reflects the extent to whichclusters can be differentiated with reference to that feature. It willbe appreciated that, in this case, the clusters are differentiated alongonly a couple of clinical feature dimensions, in particular, featuredimension that indicate the site of onset of ALS.

FIG. 33A shows a chart that illustrates the extent to which patientclusters identified by the machine learning system differentiate alonggene expression feature dimensions when the machine learning systemprocesses unimodal patient data, in particular, gene expression dataalone. The length of the bar associated with each feature reflects theextent to which clusters can be differentiated with reference to thatfeature. It will be appreciated that, in this case, the clusters aredifferentiated along many gene expression feature dimensions.

FIG. 33B shows a chart that illustrates the extent to which patientclusters identified by the machine learning system differentiate alonggene expression feature dimensions when the machine learning systemprocesses multi-modal patient data, in particular, gene expression dataand clinical data. The length of the bar associated with each featurereflects the extent to which clusters can be differentiated withreference to that feature. It will be appreciated that, in this case,the clusters are differentiated along many gene expression featuredimensions.

FIG. 33C shows a chart that illustrates the extent to which patientclusters identified by the machine learning system differentiate alonggene expression feature dimensions when the machine learning systemprocesses unimodal patient data, in particular, clinical data alone. Thelength of the bar associated with each feature reflects the extent towhich clusters can be differentiated with reference to that feature. Itwill be appreciated that, in this case, the clusters do notdifferentiate along gene expression feature dimensions.

Comparing and contrasting FIGS. 32A-C and FIGS. 33A-C suggests that themachine learning system should process both gene expression data andclinical data in order to identify well-differentiated categories(sub-types) of patients with ALS. More generally, FIGS. 32A-C and FIGS.33A-C suggest that processing multi-modal patient data (as opposed tounimodal patient data) can enable the machine learning system tostratify patients into well-differentiated clusters. Clusters can bereferred to as being “well-differentiated” if they are differentiatedalong many feature dimensions. Well-differentiated clusters are morelikely to represent categories of patients that differentiate inclinically significant and reproducible ways, e.g., such that patientsin the same cluster are more likely to share characteristics such asresponse and/or side effects from receiving a medical treatment, e.g., adrug.

This specification uses the term “configured” in connection with systemsand computer program components. For a system of one or more computersto be configured to perform particular operations or actions means thatthe system has installed on it software, firmware, hardware, or acombination of them that in operation cause the system to perform theoperations or actions. For one or more computer programs to beconfigured to perform particular operations or actions means that theone or more programs include instructions that, when executed by dataprocessing apparatus, cause the apparatus to perform the operations oractions.

Embodiments of the subject matter and the functional operationsdescribed in this specification can be implemented in digital electroniccircuitry, in tangibly-embodied computer software or firmware, incomputer hardware, including the structures disclosed in thisspecification and their structural equivalents, or in combinations ofone or more of them. Embodiments of the subject matter described in thisspecification can be implemented as one or more computer programs, i.e.,one or more modules of computer program instructions encoded on atangible non-transitory storage medium for execution by, or to controlthe operation of, data processing apparatus. The computer storage mediumcan be a machine-readable storage device, a machine-readable storagesubstrate, a random or serial access memory device, or a combination ofone or more of them. Alternatively or in addition, the programinstructions can be encoded on an artificially-generated propagatedsignal, e.g., a machine-generated electrical, optical, orelectromagnetic signal, that is generated to encode information fortransmission to suitable receiver apparatus for execution by a dataprocessing apparatus.

The term “data processing apparatus” refers to data processing hardwareand encompasses all kinds of apparatus, devices, and machines forprocessing data, including by way of example a programmable processor, acomputer, or multiple processors or computers. The apparatus can alsobe, or further include, special purpose logic circuitry, e.g., an FPGA(field programmable gate array) or an ASIC (application-specificintegrated circuit). The apparatus can optionally include, in additionto hardware, code that creates an execution environment for computerprograms, e.g., code that constitutes processor firmware, a protocolstack, a database management system, an operating system, or acombination of one or more of them.

A computer program, which may also be referred to or described as aprogram, software, a software application, an app, a module, a softwaremodule, a script, or code, can be written in any form of programminglanguage, including compiled or interpreted languages, or declarative orprocedural languages; and it can be deployed in any form, including as astand-alone program or as a module, component, subroutine, or other unitsuitable for use in a computing environment. A program may, but neednot, correspond to a file in a file system. A program can be stored in aportion of a file that holds other programs or data, e.g., one or morescripts stored in a markup language document, in a single file dedicatedto the program in question, or in multiple coordinated files, e.g.,files that store one or more modules, sub-programs, or portions of code.A computer program can be deployed to be executed on one computer or onmultiple computers that are located at one site or distributed acrossmultiple sites and interconnected by a data communication network.

In this specification the term “engine” is used broadly to refer to asoftware-based system, subsystem, or process that is programmed toperform one or more specific functions. Generally, an engine will beimplemented as one or more software modules or components, installed onone or more computers in one or more locations. In some cases, one ormore computers will be dedicated to a particular engine; in other cases,multiple engines can be installed and running on the same computer orcomputers.

The processes and logic flows described in this specification can beperformed by one or more programmable computers executing one or morecomputer programs to perform functions by operating on input data andgenerating output. The processes and logic flows can also be performedby special purpose logic circuitry, e.g., an FPGA or an ASIC, or by acombination of special purpose logic circuitry and one or moreprogrammed computers.

Computers suitable for the execution of a computer program can be basedon general or special purpose microprocessors or both, or any other kindof central processing unit. Generally, a central processing unit willreceive instructions and data from a read-only memory or a random accessmemory or both. The essential elements of a computer are a centralprocessing unit for performing or executing instructions and one or morememory devices for storing instructions and data. The central processingunit and the memory can be supplemented by, or incorporated in, specialpurpose logic circuitry. Generally, a computer will also include, or beoperatively coupled to receive data from or transfer data to, or both,one or more mass storage devices for storing data, e.g., magnetic,magneto-optical disks, or optical disks. However, a computer need nothave such devices. Moreover, a computer can be embedded in anotherdevice, e.g., a mobile telephone, a personal digital assistant (PDA), amobile audio or video player, a game console, a Global PositioningSystem (GPS) receiver, or a portable storage device, e.g., a universalserial bus (USB) flash drive, to name just a few.

Computer-readable media suitable for storing computer programinstructions and data include all forms of non-volatile memory, mediaand memory devices, including by way of example semiconductor memorydevices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks,e.g., internal hard disks or removable disks; magneto-optical disks; andCD-ROM and DVD-ROM disks.

To provide for interaction with a user, embodiments of the subjectmatter described in this specification can be implemented on a computerhaving a display device, e.g., a CRT (cathode ray tube) or LCD (liquidcrystal display) monitor, for displaying information to the user and akeyboard and a pointing device, e.g., a mouse or a trackball, by whichthe user can provide input to the computer. Other kinds of devices canbe used to provide for interaction with a user as well; for example,feedback provided to the user can be any form of sensory feedback, e.g.,visual feedback, auditory feedback, or tactile feedback; and input fromthe user can be received in any form, including acoustic, speech, ortactile input. In addition, a computer can interact with a user bysending documents to and receiving documents from a device that is usedby the user; for example, by sending web pages to a web browser on auser’s device in response to requests received from the web browser.Also, a computer can interact with a user by sending text messages orother forms of message to a personal device, e.g., a smartphone that isrunning a messaging application, and receiving responsive messages fromthe user in return.

Data processing apparatus for implementing machine learning models canalso include, for example, special-purpose hardware accelerator unitsfor processing common and compute-intensive parts of machine learningtraining or production, i.e., inference, workloads.

Machine learning models can be implemented and deployed using anyappropriate machine learning framework.

Embodiments of the subject matter described in this specification can beimplemented in a computing system that includes a back-end component,e.g., as a data server, or that includes a middleware component, e.g.,an application server, or that includes a front-end component, e.g., aclient computer having a graphical user interface, a web browser, or anapp through which a user can interact with an implementation of thesubject matter described in this specification, or any combination ofone or more such back-end, middleware, or front-end components. Thecomponents of the system can be interconnected by any form or medium ofdigital data communication, e.g., a communication network. Examples ofcommunication networks include a local area network (LAN) and a widearea network (WAN), e.g., the Internet.

The computing system can include clients and servers. A client andserver are generally remote from each other and typically interactthrough a communication network. The relationship of client and serverarises by virtue of computer programs running on the respectivecomputers and having a client-server relationship to each other. In someembodiments, a server transmits data, e.g., an HTML page, to a userdevice, e.g., for purposes of displaying data to and receiving userinput from a user interacting with the device, which acts as a client.Data generated at the user device, e.g., a result of the userinteraction, can be received at the server from the device.

While this specification contains many specific implementation details,these should not be construed as limitations on the scope of anyinvention or on the scope of what may be claimed, but rather asdescriptions of features that may be specific to particular embodimentsof particular inventions. Certain features that are described in thisspecification in the context of separate embodiments can also beimplemented in combination in a single embodiment. Conversely, variousfeatures that are described in the context of a single embodiment canalso be implemented in multiple embodiments separately or in anysuitable subcombination. Moreover, although features may be describedabove as acting in certain combinations and even initially be claimed assuch, one or more features from a claimed combination can in some casesbe excised from the combination, and the claimed combination may bedirected to a subcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings and recited inthe claims in a particular order, this should not be understood asrequiring that such operations be performed in the particular ordershown or in sequential order, or that all illustrated operations beperformed, to achieve desirable results. In certain circumstances,multitasking and parallel processing may be advantageous. Moreover, theseparation of various system modules and components in the embodimentsdescribed above should not be understood as requiring such separation inall embodiments, and it should be understood that the described programcomponents and systems can generally be integrated together in a singlesoftware product or packaged into multiple software products.

Particular embodiments of the subject matter have been described. Otherembodiments are within the scope of the following claims. For example,the actions recited in the claims can be performed in a different orderand still achieve desirable results. As one example, the processesdepicted in the accompanying figures do not necessarily require theparticular order shown, or sequential order, to achieve desirableresults. In some cases, multitasking and parallel processing may beadvantageous.

What is claimed is:
 1. A method performed by one or more computers, themethod comprising: generating a drug signature for a drug, wherein: thedrug signature comprises a respective impact score for each of aplurality of features; and the impact score for a feature characterizesan impact, caused by administering a drug to one or more entities, on avalue of the feature measured for the one or more entities; generatingan embedding of the drug signature in a latent space, comprising:generating a network input to an encoder neural network based on thedrug signature; and processing the network input generated based on thedrug signature using the encoder neural network to generate theembedding of the drug signature in the latent space; and processing: (i)the embedding of the drug signature in the latent space, and (ii) datadefining a plurality of patient categories, to generate a plurality ofresponse scores, wherein each response score corresponds to a respectivepatient category and characterizes a predicted response of patientsincluded in the patient category to the drug.
 2. The method of claim 1,wherein generating the drug signature comprises: obtaining, for each ofthe entities: (i) a pre-treatment feature representation of the entitythat comprises, for each of the plurality of features, a respectivepre-treatment value of the feature that is measured for the entity priorto the drug being administered to the entity; and (ii) a post-treatmentfeature representation of the entity that comprises, for each of theplurality of features, a respective post-treatment value of the featurethat is measured for the entity after the drug is administered to theentity; and generating the drug signature based on the pre-treatment andpost-treatment feature representations of the entities.
 3. The method ofclaim 2, wherein generating the drug signature based on thepre-treatment and post-treatment feature representations of the entitiescomprises: generating, for each of the plurality of entities, arespective differential feature representation of the entity based on adifference between: (i) the pre-treatment feature representation of theentity, and (ii) the post-treatment feature representation of theentity; and generating the drug signature based on the differentialfeature representations of the entities.
 4. The method of claim 3,wherein generating the drug signature based on the differential featurerepresentations of the entities comprises: generating a respectiveentity-specific drug signature for each of the entities based on thedifferential feature representation of the entity; and generating thedrug signature by combining the entity-specific drug signatures.
 5. Themethod of claim 4, wherein for each of the entities, generating theentity-specific drug signature for the entity comprises: element-wisedividing the differential feature representation for the entity by thepre-treatment feature representation of the entity.
 6. The method ofclaim 4, wherein generating the drug signature by combining theentity-specific drug signatures comprises: averaging the entity-specificdrug signatures.
 7. The method of claim 1, wherein drug signaturecomprises one or more impact scores that each characterize an impact,caused by administering the drug to the one or more entities, on a levelof expression of a respective gene in the one or more entities.
 8. Themethod of claim 1, wherein the drug signature comprises one or moreimpact scores that each characterize an impact, caused by administeringthe drug to the one or more entities, on a level of expression of arespective protein in the one or more entities.
 9. The method of claim1, wherein the network input to the encoder neural network includes thedrug signature.
 10. The method of claim 1, wherein each of the pluralityof patient categories is defined by a cluster of patient embeddings inthe latent space, wherein each patient embedding corresponds to arespective patient and is generated by processing multi-modal datacharacterizing the patient using the encoder neural network.
 11. Themethod of claim 10, wherein for each of the plurality of patientcategories, generating the response score for the patient categorycomprises: determining a respective similarity measure between: (i) theembedding of the drug signature, and (ii) each of one or more patientembeddings in the cluster of patient embeddings defining the patientcategory; and determining the response score for the patient categorybased on the similarity measures.
 12. The method of claim 1, furthercomprising determining a ranking of the plurality of patient categoriesbased on the response scores.
 13. The method of claim 1, furthercomprising: determining that a new patient is included in a patientcategory of the plurality of patient categories; identifying theresponse score for the patient category of the new patient; andautomatically generating a recommendation for whether the new patientshould receive the drug based at least in part on the response score forthe patient category of the new patient.
 14. The method of claim 1,wherein each of the one or more entities comprises a cell.
 15. Themethod of claim 1, wherein each of the one or more entities comprises acollection of cells.
 16. The method of claim 1, wherein each of the oneor more entities is a patient.
 17. The method of claim 1, wherein theencoder neural network has been trained to process multi-modal datacharacterizing patients.
 18. The method of claim 17, wherein the encoderneural network has been trained by operations comprising: obtaining aplurality of training examples, wherein each training examplecorresponds to a respective patient and includes multi-modal data thatcharacterizes the patient; jointly training the encoder neural networkalong with a decoder neural network on the plurality of trainingexamples, comprising, for each training example: processing themulti-modal data from the training example using the encoder neuralnetwork to generate an embedding of the multi-modal data from thetraining example; processing the embedding of the multi-modal data fromthe training example using the decoder neural network to generate areconstruction of the multi-modal data from the training example; andupdating current values of a set of encoder parameters and currentvalues of a set of decoder parameters using gradients of areconstruction loss function that measures an error in thereconstruction of the multi-modal data from the training example.
 19. Asystem comprising: one or more computers; and one or more storagedevices communicatively coupled to the one or more computers, whereinthe one or more storage devices store instructions that, when executedby the one or more computers, cause the one or more computers to performoperations comprising: generating a drug signature for a drug, wherein:the drug signature comprises a respective impact score for each of aplurality of features; and the impact score for a feature characterizesan impact, caused by administering a drug to one or more entities, on avalue of the feature measured for the one or more entities; generatingan embedding of the drug signature in a latent space, comprising:generating a network input to an encoder neural network based on thedrug signature; and processing the network input generated based on thedrug signature using the encoder neural network to generate theembedding of the drug signature in the latent space; and processing: (i)the embedding of the drug signature in the latent space, and (ii) datadefining a plurality of patient categories, to generate a plurality ofresponse scores, wherein each response score corresponds to a respectivepatient category and characterizes a predicted response of patientsincluded in the patient category to the drug.
 20. One or morenon-transitory computer storage media storing instructions that whenexecuted by one or more computers cause the one or more computers toperform operations comprising: generating a drug signature for a drug,wherein: the drug signature comprises a respective impact score for eachof a plurality of features; and the impact score for a featurecharacterizes an impact, caused by administering a drug to one or moreentities, on a value of the feature measured for the one or moreentities; generating an embedding of the drug signature in a latentspace, comprising: generating a network input to an encoder neuralnetwork based on the drug signature; and processing the network inputgenerated based on the drug signature using the encoder neural networkto generate the embedding of the drug signature in the latent space; andprocessing: (i) the embedding of the drug signature in the latent space,and (ii) data defining a plurality of patient categories, to generate aplurality of response scores, wherein each response score corresponds toa respective patient category and characterizes a predicted response ofpatients included in the patient category to the drug.